Author: "Jun, Yan" / Search Limiters: Full Text - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jun, Yan"' showing total 7,404 results

Start Over Author "Jun, Yan" Search Limiters Full Text

7,404 results on '"Jun, Yan"'

1. Multi-subject Open-set Personalization in Video Generation

Author: Chen, Tsai-Shien, Siarohin, Aliaksandr, Menapace, Willi, Fang, Yuwei, Lee, Kwot Sin, Skorokhodov, Ivan, Aberman, Kfir, Zhu, Jun-Yan, Yang, Ming-Hsuan, and Tulyakov, Sergey
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations., Comment: Project page: https://snap-research.github.io/open-set-video-personalization/
Published: 2025

2. Object-level Visual Prompts for Compositional Image Generation

Author: Parmar, Gaurav, Patashnik, Or, Wang, Kuan-Chieh, Ostashev, Daniil, Narasimhan, Srinivasa, Zhu, Jun-Yan, Cohen-Or, Daniel, and Aberman, Kfir
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Graphics
Abstract: We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation., Comment: Project: https://snap-research.github.io/visual-composer/
Published: 2025

3. UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

Author: Jiang, Haoyu, Cheng, Zhi-Qi, Moreira, Gabriel, Zhu, Jiawen, Sun, Jingdong, Ren, Bukun, He, Jun-Yan, Dai, Qi, and Hua, Xian-Sheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Multimedia
Abstract: Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation through a two-phase training strategy. First, Source Adapter Learning integrates class semantics with domain-specific visual knowledge using a Learnable Textual Semantic Template and optimizes Class and Domain Prompts via momentum updates and dual loss functions for robust alignment. Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts to evolving data distributions, enhancing both flexibility and generalization. During inference, only the image branch and generated prompts are used, eliminating reliance on textual inputs for highly efficient retrieval. Extensive benchmark experiments show that UCDR-Adapter consistently outperforms ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and U(d)CDR settings., Comment: Accepted to WACV 2025. Project link: https://github.com/fine68/UCDR2024
Published: 2024

4. Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation

Author: Gao, Ruihan, Deng, Kangle, Yang, Gengshan, Yuan, Wenzhen, and Zhu, Jun-Yan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: 3D generation methods have shown visually compelling results powered by diffusion image priors. However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch., Comment: Accepted to NeurIPS 2024. Project webpage: https://ruihangao.github.io/TactileDreamFusion/ Code: https://github.com/RuihanGao/TactileDreamFusion
Published: 2024

5. Optical constraints on the coldest metal-poor population

Author: Zhang, Jerry Jun-Yan, Lodieu, Nicolas, Martín, Eduardo L., Osorio, María Rosa Zapatero, Béjar, Victor J. S., Ivanov, Valentin D., Boffin, Henri M. J., Shahbaz, Tariq, Pavlenko, Yakiv V., Rebolo, Rafael, Gauza, Bartosz, Sedighi, Nafise, and Quezada, Carlos
Subjects: Astrophysics - Solar and Stellar Astrophysics, Astrophysics - Astrophysics of Galaxies
Abstract: The coldest metal-poor population made of T and Y dwarfs are archaeological tracers of our Galaxy because they are very old and have kept the pristine material. The optical properties of these objects are important to characterise their atmospheric properties. We aim at characterising further the optical properties of ultracool metal-poor population with deep far-red optical images and parallax determinations. We solve trigonometric parallaxes of five metal-poor T dwarf candidates using 2-year monitoring with Calar-Alto 3.5-m telescope. We obtain $z'$-band photometry for the other 12 metal-poor T dwarf candidates using the 10.4-m GTC, the 8.2-m VLT, and the DES, increasing the sample of T subdwarfs with optical photometry from 12 to 24. We report a 3-$\sigma$ limit for the Accident in five optical bands using the 10.4-m GTC. We confirm four T subdwarfs and the Accident as a Y subdwarf, and propose two more Y subdwarf candidates. We emphasise that the $z_{PS1}-W1$ colour combining with the $W1-W2$ colour could break the metallicity-temperature degeneracy for T and possibly for Y dwarfs. The $z_{PS1}-W1$ colour shifts redward when metallicity decreases for a certain temperature, which is not predicted by state-of-the-art ultracool models. The Accident has the reddest $z_{PS1}-W1$ colour among our sample. The $z_{PS1}-W1$ colour will be useful to search for other examples of this cold and old population in upcoming and existing deep optical and infrared large-area surveys., Comment: 17 pages, 7 figures, 2 appendices, submitted to A&A, comments are welcome
Published: 2024

6. GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts

Author: He, Junwen, Wang, Yifan, Wang, Lijun, Lu, Huchuan, He, Jun-Yan, Li, Chenyang, Chen, Hanyuan, Lan, Jin-Peng, Luo, Bin, and Geng, Yifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text logo design heavily relies on the creativity and expertise of professional designers, in which arranging element layouts is one of the most important procedures. However, few attention has been paid to this specific task which needs to take precise textural details and user constraints into consideration, but only on the broader tasks such as document/poster layout generation. In this paper, we propose a VLM-based framework that generates content-aware text logo layouts by integrating multi-modal inputs with user constraints, supporting a more flexible and stable layout design in real-world applications. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously, while does not face performance degradation. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset. Except for the geometric annotations (e.g. text masks and character recognition), we also compliment with comprehensive layout descriptions in natural language format, for more effective training to have reasoning ability when dealing with complex layouts and custom user constraints. Experimental studies demonstrate the effectiveness of our proposed model and datasets, when comparing with previous methods in various benchmarks to evaluate geometric aesthetics and human preferences. The code and datasets will be publicly available.
Published: 2024

7. Intelligent Adaptive Metasurface in Complex Wireless Environments

Author: Yang, Han Qing, Dai, Jun Yan, Li, Hui Dong, Wu, Lijie, Zhang, Meng Zhen, Shen, Zi Hang, Wang, Si Ran, Wang, Zheng Xing, Tang, Wankai, Jin, Shi, Wu, Jun Wei, Cheng, Qiang, and Cui, Tie Jun
Subjects: Physics - Applied Physics, Electrical Engineering and Systems Science - Systems and Control
Abstract: The programmable metasurface is regarded as one of the most promising transformative technologies for next-generation wireless system applications. Due to the lack of effective perception ability of the external electromagnetic environment, there are numerous challenges in the intelligent regulation of wireless channels, and it still relies on external sensors to reshape electromagnetic environment as desired. To address that problem, we propose an adaptive metasurface (AMS) which integrates the capabilities of acquiring wireless environment information and manipulating reflected electromagnetic (EM) waves in a programmable manner. The proposed design endows the metasurfaces with excellent capabilities to sense the complex electromagnetic field distributions around them and then dynamically manipulate the waves and signals in real time under the guidance of the sensed information, eliminating the need for prior knowledge or external inputs about the wireless environment. For verification, a prototype of the proposed AMS is constructed, and its dual capabilities of sensing and manipulation are experimentally validated. Additionally, different integrated sensing and communication (ISAC) scenarios with and without the aid of the AMS are established. The effectiveness of the AMS in enhancing communication quality is well demonstrated in complex electromagnetic environments, highlighting its beneficial application potential in future wireless systems.
Published: 2024

8. Multiple-partition cross-modulation programmable metasurface empowering wireless communications

Author: Zhang, Jun Wei, Qi, Zhen Jie, Wu, Li Jie, Cao, Wan Wan, Gao, Xinxin, Fu, Zhi Hui, Chen, Jing Yu, Lv, Jie Ming, Wang, Zheng Xing, Wang, Si Ran, Wu, Jun Wei, Zhang, Zhen, Zhang, Jia Nan, Li, Hui Dong, Dai, Jun Yan, Cheng, Qiang, and Cui, Tie Jun
Subjects: Physics - Applied Physics
Abstract: With the versatile manipulation capability, programmable metasurfaces are rapidly advancing in their intelligence, integration, and commercialization levels. However, as the programmable metasurfaces scale up, their control configuration becomes increasingly complicated, posing significant challenges and limitations. Here, we propose a multiple-partition cross-modulation (MPCM) programmable metasurface to enhance the wireless communication coverage with low hardware complexity. We firstly propose an innovative encoding scheme to multiply the control voltage vectors of row-column crossing, achieving high beamforming precision in free space while maintaining low control hardware complexity and reducing memory requirements for coding sequences. We then design and fabricate an MPCM programmable metasurface to confirm the effectiveness of the proposed encoding scheme. The simulated and experimental results show good agreements with the theoretically calculated outcomes in beam scanning across the E and H planes and in free-space beam pointing. The MPCM programmable metasurface offers strong flexibility and low complexity by allowing various numbers and combinations of partition items in modulation methods, catering to diverse precision demands in various scenarios. We demonstrate the performance of MPCM programmable metasurface in a realistic indoor setting, where the transmissions of videos to specific receiver positions are successfully achieved, surpassing the capabilities of traditional programmable metasurfaces. We believe that the proposed programmable metasurface has great potentials in significantly empowering the wireless communications while addressing the challenges associated with the programmable metasurface's design and implementation.
Published: 2024

9. SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Author: Li, Muyang, Lin, Yujun, Zhang, Zhekai, Cai, Tianle, Li, Xiuyu, Guo, Junxian, Xie, Enze, Meng, Chenlin, Zhu, Jun-Yan, and Han, Song
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"{\i}vely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-$\Sigma$, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5$\times$, achieving 3.0$\times$ speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced., Comment: Quantization Library: https://github.com/mit-han-lab/deepcompressor Inference Engine: https://github.com/mit-han-lab/nunchaku Website: https://hanlab.mit.edu/projects/svdquant Demo: https://svdquant.mit.edu Blog: https://hanlab.mit.edu/blog/svdquant
Published: 2024

10. Metabolomics reveals soluble epoxide hydrolase as a therapeutic target for high-sucrose diet-mediated gut barrier dysfunction

Author: Lin, Ai-Zhi, Fu, Xian, Jiang, Qing, Zhou, Xue, Hwang, Sung Hee, Yin, Hou-Hua, Ni, Kai-Di, Pan, Qing-Jin, He, Xin, Zhang, Ling-Tong, Meng, Yi-Wen, Liu, Ya-Nan, Hammock, Bruce D, and Liu, Jun-Yan
Subjects: Medical Biochemistry and Metabolomics, Biomedical and Clinical Sciences, Digestive Diseases, Colo-Rectal Cancer, Nutrition, Cancer, 2.1 Biological and endogenous factors, Epoxide Hydrolases, Animals, Mice, Metabolomics, Intestinal Mucosa, Mice, Knockout, Tight Junctions, Male, Mice, Inbred C57BL, Dietary Sucrose, Sucrose, Humans, Colon, Claudins, epoxyeicosatrienoic acid, high sucrose diet, metabolomics, soluble epoxide hydrolase
Abstract: Highsucrose diet (HSD) was reported as a causative factor for multiorgan injuries. The underlying mechanisms and therapeutic strategies remain largely uncharted. In the present study, by using a metabolomics approach, we identified the soluble epoxide hydrolase (sEH) as a therapeutic target for HSD-mediated gut barrier dysfunction. Specifically, 16-week feeding on an HSD caused gut barrier dysfunction, such as colon inflammation and tight junction impairment in a murine model. A metabolomics analysis of mouse colon tissue showed a decrease in the 5(6)-epoxyeicosatrienoic acid [5(6)-EET] level and an increase in soluble epoxide hydrolase, which is related to HSD-mediated injuries to the gut barrier. The mice treated with a chemical inhibitor of sEH and the mice with genetic intervention by intestinal-specific knockout of the sEH gene significantly attenuated HSD-caused intestinal injuries by reducing HSD-mediated colon inflammation and improving the impaired tight junction caused by an HSD. Further, in vitro studies showed that treatment with 5(6)-EET, but not its hydrolytic product 5,6-dihydroxyeicosatrienoic acid (5,6-DiHET), significantly ablated high sucrose-caused intestinal epithelial inflammation and impaired tight junction. Additionally, 5(6)-EET is anti-inflammatory and improves gut epithelial tight junction while 5,6-DiHET cannot do so. This study presents an underlying mechanism of and a therapeutic strategy for the gut barrier dysfunction caused by an HSD.
Published: 2024

11. POPoS: Improving Efficient and Robust Facial Landmark Detection with Parallel Optimal Position Search

Author: Xiang, Chong-Yang, He, Jun-Yan, Cheng, Zhi-Qi, Wu, Xiao, and Hua, Xian-Sheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the limitations of traditional FLD methods. POPoS employs three key contributions: (1) Pseudo-range multilateration is utilized to correct heatmap errors, improving landmark localization accuracy. By integrating multiple anchor points, it reduces the impact of individual heatmap inaccuracies, leading to robust overall positioning. (2) To enhance the pseudo-range accuracy of selected anchor points, a new loss function, named multilateration anchor loss, is proposed. This loss function enhances the accuracy of the distance map, mitigates the risk of local optima, and ensures optimal solutions. (3) A single-step parallel computation algorithm is introduced, boosting computational efficiency and reducing processing time. Extensive evaluations across five benchmark datasets demonstrate that POPoS consistently outperforms existing methods, particularly excelling in low-resolution heatmaps scenarios with minimal computational overhead. These advantages make POPoS a highly efficient and accurate tool for FLD, with broad applicability in real-world scenarios., Comment: Accepted to AAAI 2025, 9 pages, 6 figures. Code: https://github.com/teslatasy/POPoS
Published: 2024

12. Simplified radar architecture based on information metasurface

Author: Wang, Si Ran, Chen, Zhan Ye, Chen, Shao Nan, Dai, Jun Yan, Zhang, Jun Wei, Qi, Zhen Jie, Wu, Li Jie, Sun, Meng Ke, Zhou, Qun Yan, Li, Hui Dong, Luo, Zhang Jie, Cheng, Qiang, and Cui, Tie Jun
Subjects: Physics - Applied Physics
Abstract: Modern radar typically employs a chain architecture that consists of radio-frequency (RF) and intermediate frequency (IF) units, baseband digital signal processor, and information display. However, this architecture often results in high costs, significant hardware demands, and integration challenges. Here we propose a simplified radar architecture based on space-time-coding (STC) information metasurfaces. With their powerful capabilities to generate multiple harmonic frequencies and customize their phases, the STC metasurfaces play a key role in chirp signal generation, transmission, and echo reception. Remarkably, the receiving STC metasurface can implement dechirp processing directly on the RF level and realize the digital information outputs, which are beneficial to lower the hardware requirement at the receiving end while potentially shortening the time needed for conventional digital processing. As a proof of concept, the proposed metasurface radar is tested in a series of experiments for target detection and range/speed measurement, yielding results comparable to those obtained by conventional methods. This study provides valuable inspiration for a new radar system paradigm to combine the RF front ends and signal processors on the information metasurface platform that offers essential functionalities while significantly reducing the system complexity and cost., Comment: 25 pages, 10 figures
Published: 2024

13. Generative Photomontage

Author: Liu, Sean J., Kumari, Nupur, Shamir, Ariel, and Zhu, Jun-Yan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines., Comment: Project webpage: https://lseancs.github.io/generativephotomontage/ ; corrected typos in v2
Published: 2024

14. MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

Author: He, Jun-Yan, Cheng, Zhi-Qi, Li, Chenyang, Sun, Jingdong, He, Qi, Xiang, Wangmeng, Chen, Hanyuan, Lan, Jin-Peng, Lin, Xianhui, Zhu, Kang, Luo, Bin, Geng, Yifeng, Xie, Xuansong, and Hauptmann, Alexander G.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Multimedia
Abstract: MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner's capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results., Comment: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt
Published: 2024

15. Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Author: Li, Heng, Li, Minghan, Cheng, Zhi-Qi, Dong, Yifei, Zhou, Yuxuan, He, Jun-Yan, Dai, Qi, Mitamura, Teruko, and Hauptmann, Alexander G.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments., Comment: Spotlight at NeurIPS 2024 D&B Track. 32 pages, 18 figures, Project Page: https://lpercc.github.io/HA3D_simulator/
Published: 2024

16. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Author: Cheng, Zebang, Cheng, Zhi-Qi, He, Jun-Yan, Sun, Jingdong, Wang, Kai, Lin, Yuxiang, Lian, Zheng, Peng, Xiaojiang, and Hauptmann, Alexander
Subjects: Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset., Comment: Accepted at NeurIPS 2024. 49 pages, 13 figures, Project: https://github.com/ZebangCheng/Emotion-LLaMA, Demo: https://huggingface.co/spaces/ZebangCheng/Emotion-LLaMA
Published: 2024

17. Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

Author: Wang, Sheng-Yu, Hertzmann, Aaron, Efros, Alexei A., Zhu, Jun-Yan, and Zhang, Richard
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: The goal of data attribution for text-to-image models is to identify the training images that most influence the generation of a new image. Influence is defined such that, for a given output, if a model is retrained from scratch without the most influential images, the model would fail to reproduce the same output. Unfortunately, directly searching for these influential images is computationally infeasible, since it would require repeatedly retraining models from scratch. In our work, we propose an efficient data attribution method by simulating unlearning the synthesized image. We achieve this by increasing the training loss on the output image, without catastrophic forgetting of other, unrelated concepts. We then identify training images with significant loss deviations after the unlearning process and label these as influential. We evaluate our method with a computationally intensive but "gold-standard" retraining from scratch and demonstrate our method's advantages over previous methods., Comment: Updated v2 -- NeurIPS 2024 camera ready version. Project page: https://peterwang512.github.io/AttributeByUnlearning Code: https://github.com/PeterWang512/AttributeByUnlearning
Published: 2024

18. Distilling Diffusion Models into Conditional GANs

Author: Kang, Minguk, Zhang, Richard, Barnes, Connelly, Paris, Sylvain, Kwak, Suha, Park, Jaesik, Shechtman, Eli, Zhu, Jun-Yan, and Park, Taesung
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark., Comment: Project page: https://mingukkang.github.io/Diffusion2GAN/ (ECCV2024)
Published: 2024

19. Customizing Text-to-Image Models with a Single Image Pair

Author: Jones, Maxwell, Wang, Sheng-Yu, Kumari, Nupur, Bau, David, and Zhu, Jun-Yan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair., Comment: project page: https://paircustomization.github.io/
Published: 2024

20. MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Author: Li, Xiang, Cheng, Zhi-Qi, He, Jun-Yan, Peng, Xiaojiang, and Hauptmann, Alexander G.
Subjects: Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
Published: 2024

21. On the Content Bias in Fr\'echet Video Distance

Author: Ge, Songwei, Mahapatra, Aniruddha, Parmar, Gaurav, Zhu, Jun-Yan, and Huang, Jia-Bin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Fr\'echet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis., Comment: CVPR 2024. Project webpage: https://content-debiased-fvd.github.io/
Published: 2024

22. Customizing Text-to-Image Diffusion with Object Viewpoint Control

Author: Kumari, Nupur, Su, Grace, Zhang, Richard, Park, Taesung, Shechtman, Eli, and Zhu, Jun-Yan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object's properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object's appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object's identity while following the target object viewpoint and the text prompt., Comment: project page: https://customdiffusion360.github.io
Published: 2024

23. Exploring Dynamic Transformer for Efficient Object Tracking

Author: Zhu, Jiawen, Chen, Xin, Diao, Haiwen, Li, Shuai, He, Jun-Yan, Li, Chenyang, Luo, Bin, Wang, Dong, and Lu, Huchuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The speed-precision trade-off is a critical problem for visual object tracking which usually requires low latency and deployment on constrained resources. Existing solutions for efficient tracking mainly focus on adopting light-weight backbones or modules, which nevertheless come at the cost of a sacrifice in precision. In this paper, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit diverse levels of complexity. We argue that a simple network is sufficient for easy frames in video sequences, while more computation could be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Thus, it can achieve higher performance with the same running speed. We formulate instance-specific tracking as a sequential decision problem and attach terminating branches to intermediate layers of the entire model. Especially, to fully utilize the computations, we introduce the feature recycling mechanism to reuse the outputs of predecessors. Furthermore, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early predictions by effectively mimicking the representation pattern of the deep model. Extensive experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model. For instance, DyTrack obtains 64.9% AUC on LaSOT with a speed of 256 fps.
Published: 2024

24. Reconnaissance ultracool spectra in the Euclid Deep Fields

Author: Zhang, Jerry Jun-Yan, Lodieu, Nicolas, and Martín, Eduardo
Subjects: Astrophysics - Solar and Stellar Astrophysics, Astrophysics - Earth and Planetary Astrophysics
Abstract: Context. Euclid will carry out a deep survey benefiting the discovery and characterisation of ultracool dwarfs (UCDs), especially in the Euclid Deep Fields (EDFs), which the telescope will scan repeatedly throughout its mission. The photometric and spectroscopic standards in the EDFs are important benchmarks, crucial for the classification and characterisation of new UCD discoveries and for the calibration of the mission itself. Aims. We aim to provide a list of photometric UCD candidates and collect near-infrared reconnaissance spectra for M, L, and T-type UCDs in the EDFs as future Euclid UCD references. Methods. In EDF North, we cross-matched public optical and infrared surveys with certain photometric criteria to select UCDs. In EDF Fornax and EDF South, we used photometrically classified samples from the literature. We also include UCDs identified by Gaia DR2. We selected 7 UCD targets with different spectral types from the lists and obtained low-resolution 0.9-2.5 {\mu}m spectra of them using GTC/EMIR and the VLT/X-shooter. We also selected a young, bright L dwarf near EDF Fornax to test the coherence of these two facilities. We included an extra T dwarf in EDF North with its published J-band spectrum. Results. We retrieved a list of 81 (49, 231) M, 8 (29, 115) L, and 1 (0, 2) T dwarf candidates in EDF North, Fornax, and South, respectively. They are provided to guide future UCD discoveries and characterisations by Euclid. In total, we collected near-infrared spectra for 9 UCDs, including 2 M types, 3 L types, and 4 T types in or close to the 3 EDFs. The Euclidised spectra show consistency in their spectral classification, which demonstrates that slitless Euclid spectroscopy will recover the spectral types with high fidelity for UCDs, both in the EDFs and in the wide survey. We also demonstrate that Euclid will be able to distinguish different age groups of UCDs., Comment: 9 pages, 4 figures, 3 appendices, accepted for publication in A&A on Mar 12 2024. Late-M-type, and L-type UCD lists in EDF North were corrected, reference added
Published: 2024

25. One-Step Image Translation with Text-to-Image Models

Author: Parmar, Gaurav, Park, Taesung, Narasimhan, Srinivasa, and Zhu, Jun-Yan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo., Comment: Github: https://github.com/GaParmar/img2img-turbo
Published: 2024

26. DyRoNet: Dynamic Routing and Low-Rank Adapters for Autonomous Driving Streaming Perception

Author: Huang, Xiang, Cheng, Zhi-Qi, He, Jun-Yan, Li, Chenyang, Xiang, Wangmeng, and Sun, Baigui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: The advancement of autonomous driving systems hinges on the ability to achieve low-latency and high-accuracy perception. To address this critical need, this paper introduces Dynamic Routing Network (DyRoNet), a low-rank enhanced dynamic routing framework designed for streaming perception in autonomous driving systems. DyRoNet integrates a suite of pre-trained branch networks, each meticulously fine-tuned to function under distinct environmental conditions. At its core, the framework offers a speed router module, developed to assess and route input data to the most suitable branch for processing. This approach not only addresses the inherent limitations of conventional models in adapting to diverse driving conditions but also ensures the balance between performance and efficiency. Extensive experimental evaluations demonstrate the adaptability of DyRoNet to diverse branch selection strategies, resulting in significant performance enhancements across different scenarios. This work establishes a new benchmark for streaming perception and provides valuable engineering insights for future work., Comment: Accepted to WACV 2025. 17 pages, 8 figures. Project: https://tastevision.github.io/DyRoNet/
Published: 2024

27. Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

Author: He, Junwen, Wang, Yifan, Wang, Lijun, Lu, Huchuan, He, Jun-Yan, Lan, Jin-Peng, Luo, Bin, and Xie, Xuansong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However, there still remains a gap in providing fine-grained pixel-level perceptions and extending interactions beyond text-specific inputs. In this work, we propose {\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio. This innovation empowers users with greater flexibility to engage with the model beyond textual and regional prompts, without modality-specific designs. Through our proposed refocusing mechanism, the generated grounding output is guided to better focus on the referenced object, implicitly incorporating additional pixel-level supervision. This simple modification utilizes attention scores generated during the inference of LLM, eliminating the need for extra computations while exhibiting performance enhancements in both grounding masks and referring expressions. With only publicly available training data, our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation., Comment: CVPR 2024
Published: 2024

28. Consolidating Attention Features for Multi-view Image Editing

Author: Patashnik, Or, Gal, Rinon, Cohen-Or, Daniel, Zhu, Jun-Yan, and De la Torre, Fernando
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry., Comment: Project Page at https://qnerf-consolidation.github.io/qnerf-consolidation/
Published: 2024

29. CoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting

Author: Schaldenbrand, Peter, Parmar, Gaurav, Zhu, Jun-Yan, McCann, James, and Oh, Jean
Subjects: Computer Science - Robotics
Abstract: Prior robot painting and drawing work, such as FRIDA, has focused on decreasing the sim-to-real gap and expanding input modalities for users, but the interaction with these systems generally exists only in the input stages. To support interactive, human-robot collaborative painting, we introduce the Collaborative FRIDA (CoFRIDA) robot painting framework, which can co-paint by modifying and engaging with content already painted by a human collaborator. To improve text-image alignment, FRIDA's major weakness, our system uses pre-trained text-to-image models; however, pre-trained models in the context of real-world co-painting do not perform well because they (1) do not understand the constraints and abilities of the robot and (2) cannot perform co-painting without making unrealistic edits to the canvas and overwriting content. We propose a self-supervised fine-tuning procedure that can tackle both issues, allowing the use of pre-trained state-of-the-art text-image alignment models with robots to enable co-painting in the physical world. Our open-source approach, CoFRIDA, creates paintings and drawings that match the input text prompt more clearly than FRIDA, both from a blank canvas and one with human created work. More generally, our fine-tuning procedure successfully encodes the robot's constraints and abilities into a foundation model, showcasing promising results as an effective method for reducing sim-to-real gaps.
Published: 2024

30. FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Author: Deng, Kangle, Omernick, Timothy, Weiss, Alexander, Ramanan, Deva, Zhu, Jun-Yan, Zhou, Tinghui, and Agrawala, Maneesh
Subjects: Computer Science - Graphics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our algorithm is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures., Comment: Project page: https://flashtex.github.io/
Published: 2024

31. Design of a W-band High-PAE Class A&AB Power Amplifier in 150nm GaAs Technology

Author: Leea, Jun Yan, Wu, Duo, Guoc, Xuanrui, Ariannejad, Mohammad Mahdi, Bhuiyan, Mohammad Arif Sobhan, and Miraz, Mahdi H.
Subjects: Computer Science - Networking and Internet Architecture, Computer Science - Emerging Technologies, Electrical Engineering and Systems Science - Signal Processing
Abstract: Nanometer scale power amplifiers (PA) at sub-THz suffer from severe parasitic effects that lead to experience limited maximum frequency and reduced power performance at the device transceiver front end. The integrated circuits researchers proposed different PA design architecture combinations at scaled down technologies to overcome these limitations. Although the designs meet the minimum requirements, the power added efficiency (PAE) of PA is still quite low. In this paper, a W-band single-ended common-source (CS) and cascode integrated 3-stage 2-way PA design is proposed. The design integrated different key design methodologies to mitigate the parasitic; such as combined Class AB and Class A stages for gain-boosting and efficiency enhancement, Wilkinson power combiner for higher output power, linearity, and bandwidth, and transmission line (TL)-based wide band matching network for better inter-stage matching and compact size. The proposed PA design is validated using UMS 150-nm GaAs pHEMT using advanced design system (ADS) simulator. The results show that the proposed PA achieved a gain of 20.1 dB, an output power of 17.2 dBm, a PAE of 33 % and a 21 GHz bandwidth at 90 GHz Sub-THz band. The PA layout consumes only 5.66 X 2.51 mm2 die space including pads. Our proposed PA design will boost the research on sub-THz integrated circuits research and will smooth the wide spread adoption of 6G in near future.
Published: 2024
Full Text: View/download PDF

32. An Exploratory Pilot Study on the Application of Radiofrequency Ablation for Atrial Fibrillation Guided by Computed Tomography-Based 3D Printing Technology

Author: Yue, Jun-Yan, Li, Pei-Cheng, Li, Mei-Xia, Wu, Qing-Wu, Liang, Chang-Hua, Chen, Jie, Zhu, Zhi-Ping, Li, Pei-Heng, Dou, Wen-Guang, and Gao, Jian-Bo
Published: 2024
Full Text: View/download PDF

33. WordArt Designer API: User-Driven Artistic Typography Synthesis with Large Language Models on ModelScope

Author: He, Jun-Yan, Cheng, Zhi-Qi, Li, Chenyang, Sun, Jingdong, Xiang, Wangmeng, Hu, Yusen, Lin, Xianhui, Kang, Xiaoyang, Jin, Zengke, Luo, Bin, Geng, Yifeng, Xie, Xuansong, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: This paper introduces the WordArt Designer API, a novel framework for user-driven artistic typography synthesis utilizing Large Language Models (LLMs) on ModelScope. We address the challenge of simplifying artistic typography for non-professionals by offering a dynamic, adaptive, and computationally efficient alternative to traditional rigid templates. Our approach leverages the power of LLMs to understand and interpret user input, facilitating a more intuitive design process. We demonstrate through various case studies how users can articulate their aesthetic preferences and functional requirements, which the system then translates into unique and creative typographic designs. Our evaluations indicate significant improvements in user satisfaction, design flexibility, and creative expression over existing systems. The WordArt Designer API not only democratizes the art of typography but also opens up new possibilities for personalized digital communication and design., Comment: Spotlight Paper at the Workshop on Machine Learning for Creativity and Design, 37th Conference on Neural Information Processing Systems (NeurIPS 2023). 5 pages, 5 figures
Published: 2024

34. Tracking with Human-Intent Reasoning

Author: Zhu, Jiawen, Cheng, Zhi-Qi, He, Jun-Yan, Li, Chenyang, Luo, Bin, Lu, Huchuan, Geng, Yifeng, and Xie, Xuansong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Advances in perception modeling have significantly improved the performance of object tracking. However, the current methods for specifying the target object in the initial frame are either by 1) using a box or mask template, or by 2) providing an explicit language description. These manners are cumbersome and do not allow the tracker to have self-reasoning ability. Therefore, this work proposes a new tracking task -- Instruction Tracking, which involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames. To achieve this, we investigate the integration of knowledge and reasoning capabilities from a Large Vision-Language Model (LVLM) for object tracking. Specifically, we propose a tracker called TrackGPT, which is capable of performing complex reasoning-based tracking. TrackGPT first uses LVLM to understand tracking instructions and condense the cues of what target to track into referring embeddings. The perception component then generates the tracking results based on the embeddings. To evaluate the performance of TrackGPT, we construct an instruction tracking benchmark called InsTrack, which contains over one thousand instruction-video pairs for instruction tuning and evaluation. Experiments show that TrackGPT achieves competitive performance on referring video object segmentation benchmarks, such as getting a new state-of the-art performance of 66.5 $\mathcal{J}\&\mathcal{F}$ on Refer-DAVIS. It also demonstrates a superior performance of instruction tracking under new evaluation protocols. The code and models are available at \href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}., Comment: 8 pages, 4 figures
Published: 2023

35. Holistic Evaluation of Text-To-Image Models

Author: Lee, Tony, Yasunaga, Michihiro, Meng, Chenlin, Mai, Yifan, Park, Joon Sung, Gupta, Agrim, Zhang, Yunzhi, Narayanan, Deepak, Teufel, Hannah Benita, Bellagente, Marco, Kang, Minguk, Park, Taesung, Leskovec, Jure, Zhu, Jun-Yan, Fei-Fei, Li, Wu, Jiajun, Ermon, Stefano, and Liang, Percy
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase., Comment: NeurIPS 2023. First three authors contributed equally
Published: 2023

36. AnyText: Multilingual Visual Text Generation And Editing

Author: Tuo, Yuxiang, Xiang, Wangmeng, He, Jun-Yan, Geng, Yifeng, and Xie, Xuansong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.
Published: 2023

37. Epoxy metabolites of linoleic acid promote the development of breast cancer via orchestrating PLEC/NFκB1/CXCL9-mediated tumor growth and metastasis

Author: Ni, Kai-Di, Fu, Xian, Luo, Ying, He, Xin, Yin, Hou-Hua, Mo, Dong-Ping, Wu, Jing-Xian, Wu, Ming-Jun, Zheng, Xiao, Liu, Ya-Nan, Jiang, Qing, Zhang, Ling-Tong, Lin, Ai-Zhi, Huang, Ling, Pan, Qing-Jin, Yin, Xue-Dong, Zhang, Huan-Yu, Meng, Yi-Wen, Zhou, Xue, Pan, Jianbo, Guo, Zufeng, and Liu, Jun-Yan
Published: 2024
Full Text: View/download PDF

38. Characteristics of the early innate response induced by the aerosolized Ad5-vectored COVID-19 vaccine

Author: Zheng, Wan-Ru, Dan, Jun-Yan, Huo, Nan, Zhang, Zhe, and Hou, Li-Hua
Published: 2024
Full Text: View/download PDF

39. Comparisons between Caucasian-validated and Chinese-validated photo-numeric scales for assessing facial wrinkles

Author: Ng, Jun Yan, Zhou, Hongyu, Li, Tianqi, and Chew, Fook Tim
Published: 2024
Full Text: View/download PDF

40. A synthetic moving-envelope metasurface antenna for independent control of arbitrary harmonic orders

Author: Wu, Geng-Bo, Dai, Jun Yan, Shum, Kam Man, Chan, Ka Fai, Cheng, Qiang, Cui, Tie Jun, and Chan, Chi Hou
Published: 2024
Full Text: View/download PDF

41. PJA1-mediated suppression of pyroptosis as a driver of docetaxel resistance in nasopharyngeal carcinoma

Author: Huang, Sheng-Yan, Gong, Sha, Zhao, Yin, Ye, Ming-Liang, Li, Jun-Yan, He, Qing-Mei, Qiao, Han, Tan, Xi-Rong, Wang, Jing-Yun, Liang, Ye-Lin, Huang, Sai-Wei, He, Shi-Wei, Li, Ying-Qin, Xu, Sha, Li, Ying-Qing, and Liu, Na
Published: 2024
Full Text: View/download PDF

42. Transvaginal versus transabdominal specimen extraction in minimally invasive surgery: a systematic review and meta-analysis

Author: Chang, Jasmine Hui Er, Xu, Hongyun, Zhao, Yun, Wee, Ian Jun Yan, Ang, Joella Xiaohong, Tan, Emile Kwong-Wei, and Seow-En, Isaac
Published: 2024
Full Text: View/download PDF

43. Comparisons between wrinkles and photo-ageing detected and self-reported by the participant or identified by trained assessors reveal insights from Chinese individuals in the Singapore/Malaysia Cross-sectional Genetics Epidemiology Study (SMCGES) cohort

Author: Ng, Jun Yan, Zhou, Hongyu, Li, Tianqi, and Chew, Fook Tim
Published: 2024
Full Text: View/download PDF

44. Correction: Evaluating the consistency in different methods for measuring left atrium diameters

Author: Yue, Jun-Yan, Ji, Kai, Liu, Hai-Peng, Wu, Qing-Wu, Liang, Chang-Hua, and Gao, Jian-Bo
Published: 2024
Full Text: View/download PDF

45. Evaluating the consistency in different methods for measuring left atrium diameters

Author: Yue, Jun-Yan, Ji, Kai, Liu, Hai-Peng, Wu, Qing-Wu, Liang, Chang-Hua, and Gao, Jian-Bo
Published: 2024
Full Text: View/download PDF

46. Analysis of related factors for RA flares after SARS-CoV-2 infection: a retrospective study from patient survey

Author: Li, Rong, Zhao, Jun-Kang, Li, Qian, Zhao, Li, Su, Ya-Zhen, Zhang, Jun-yan, and Zhang, Li-Yun
Published: 2024
Full Text: View/download PDF

47. Quality of life, household income, and dietary habits are associated with the risk of sarcopenia among the Chinese elderly

Author: Wan, Hua, Hu, Yan-Hui, Li, Wei-Peng, Wang, Quan, Su, Hong, Chenshu, Jun-Yan, Lu, Xiang, and Gao, Wei
Published: 2024
Full Text: View/download PDF

48. Sleep and allergic diseases among young Chinese adults from the Singapore/Malaysia Cross-Sectional Genetic Epidemiology Study (SMCGES) cohort

Author: Wong, Qi Yi Ambrose, Lim, Jun Jie, Ng, Jun Yan, Lim, Yi Ying Eliza, Sio, Yang Yie, and Chew, Fook Tim
Published: 2024
Full Text: View/download PDF

49. Reorganization of intrinsic functional connectivity in early-stage Parkinson’s disease patients with probable REM sleep behavior disorder

Author: Dan, Xiao-Juan, Wang, Yu-Wei, Sun, Jun-Yan, Gao, Lin-Lin, Chen, Xiao, Yang, Xue-Ying, Xu, Er-He, Ma, Jing-Hong, Yan, Chao-Gan, Wu, Tao, and Chan, Piu
Published: 2024
Full Text: View/download PDF

50. The Quality of Public-Funded Oriented Physical Education Normal Students' Cultivate in Guangdong Province

Author: Jun, Yan and Sungkawadee, Panya
Abstract: Background and Aim: Guangdong Province in February 2018 the province-wide launch of the Guangdong "new teacher training" construction implementation programs, used to improve the rural music, physical education, and art teacher structure imbalance of the status quo. This research aim was to avoid the risks of the cultivation of public-funded oriented physical education normal students in Guangdong Province in all phases of the cultivation process. Materials and Methods: This study was to study the motivation for applying public-funded oriented physical education to normal students in Guangdong Province, the indicator system for the development of rural feeling, and the factors influencing the willingness to comply. Analyze software programs were used to analyze the data collected from 462 public-funded oriented physical education normal students. Researchers use self-administered questionnaires and analyze descriptive statistics and exploratory factor analysis. Use the Delphi method and hierarchical analysis, with the first 2 rounds used to determine the indicator system and the third round used to calculate the weights of each indicator. The result: (1) By the discriminant validity analysis shown the value of the factors influencing were 0.921 (Teacher's Love), 0.8800 (Self-awareness), 0.789 (Career Benefits), and 0.819 (Policy dividends). (2) By combining the results of the first and second rounds of expert questionnaires, the final evaluation index system for public-funded oriented physical education normal students' sentiments towards local teaching was determined to consist of 4 primary indicators, 10 secondary indicators, and 30 tertiary indicators. Conclusion: The factors influencing the motivation for applying for public-funded oriented physical education for normal students include four factors: Policy dividends, Career Benefits, Self-awareness, and Teacher's Love. Rural feeling can be used as a mediating variable to influence the effect value of learning engagement and willingness to comply, School Climate can be used as a moderating variable to regulate the relationship between learning engagement and Rural feeling, as well as learning engagement and willingness to comply. In particular, when the School Climate tends to be 'local', the effect of learning engagement on Rural feelings and willingness to comply is greater.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

7,404 results on '"Jun, Yan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources