Author: "Su, Yu" / Search Limiters: Available in Library Collection - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Su, Yu"' showing total 7,930 results

Start Over Author "Su, Yu" Search Limiters Available in Library Collection

7,930 results on '"Su, Yu"'

1. Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Author: Gu, Yu, Zheng, Boyuan, Gou, Boyu, Zhang, Kai, Chang, Cheng, Srivastava, Sanjari, Xie, Yanan, Qi, Peng, Sun, Huan, and Su, Yu
Subjects: Computer Science - Artificial Intelligence
Abstract: Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents., Comment: 18 pages, 6 figures, 4 tables
Published: 2024

2. Integrated Location Sensing and Communication for Ultra-Massive MIMO With Hybrid-Field Beam-Squint Effect

Author: Gao, Zhen, Zhou, Xingyu, Ning, Boyu, Su, Yu, Qin, Tong, and Niyato, Dusit
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Information Theory
Abstract: The advent of ultra-massive multiple-input-multiple output systems holds great promise for next-generation communications, yet their channels exhibit hybrid far- and near- field beam-squint (HFBS) effect. In this paper, we not only overcome but also harness the HFBS effect to propose an integrated location sensing and communication (ILSC) framework. During the uplink training stage, user terminals (UTs) transmit reference signals for simultaneous channel estimation and location sensing. This stage leverages an elaborately designed hybrid-field projection matrix to overcome the HFBS effect and estimate the channel in compressive manner. Subsequently, the scatterers' locations can be sensed from the spherical wavefront based on the channel estimation results. By treating the sensed scatterers as virtual anchors, we employ a weighted least-squares approach to derive UT' s location. Moreover, we propose an iterative refinement mechanism, which utilizes the accurately estimated time difference of arrival of multipath components to enhance location sensing precision. In the following downlink data transmission stage, we leverage the acquired location information to further optimize the hybrid beamformer, which combines the beam broadening and focusing to mitigate the spectral efficiency degradation resulted from the HFBS effect. Extensive simulation experiments demonstrate that the proposed ILSC scheme has superior location sensing and communication performance than conventional methods., Comment: This paper has been accepted by IEEE JSAC
Published: 2024

3. YourSkatingCoach: A Figure Skating Video Benchmark for Fine-Grained Element Analysis

Author: Chen, Wei-Yi, Lin, Yi-Ling, Su, Yu-An, Yeh, Wei-Hsin, and Ku, Lun-Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Combining sports and machine learning involves leveraging ML algorithms and techniques to extract insight from sports-related data such as player statistics, game footage, and other relevant information. However, datasets related to figure skating in the literature focus primarily on element classification and are currently unavailable or exhibit only limited access, which greatly raise the entry barrier to developing visual sports technology for it. Moreover, when using such data to help athletes improve their skills, we find they are very coarse-grained: they work for learning what an element is, but they are poorly suited to learning whether the element is good or bad. Here we propose air time detection, a novel motion analysis task, the goal of which is to accurately detect the duration of the air time of a jump. We present YourSkatingCoach, a large, novel figure skating dataset which contains 454 videos of jump elements, the detected skater skeletons in each video, along with the gold labels of the start and ending frames of each jump, together as a video benchmark for figure skating. In addition, although this type of task is often viewed as classification, we cast it as a sequential labeling problem and propose a Transformer-based model to calculate the duration. Experimental results show that the proposed model yields a favorable results for a strong baseline. To further verify the generalizability of the fine-grained labels, we apply the same process to other sports as cross-sports tasks but for coarse-grained task action classification. Here we fine-tune the classification to demonstrate that figure skating, as it contains the essential body movements, constitutes a strong foundation for adaptation to other sports.
Published: 2024

4. KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Author: Huang, Hsin-Ping, Wang, Xinyi, Bitton, Yonatan, Taitelbaum, Hagai, Tomar, Gaurav Singh, Chang, Ming-Wei, Jia, Xuhui, Chan, Kelvin C. K., Hu, Hexiang, Su, Yu-Chuan, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in text-to-image generation have significantly enhanced the quality of synthesized images. Despite this progress, evaluations predominantly focus on aesthetic appeal or alignment with text prompts. Consequently, there is limited understanding of whether these models can accurately represent a wide variety of realistic visual entities - a task requiring real-world knowledge. To address this gap, we propose a benchmark focused on evaluating Knowledge-InTensive image generaTion on real-world ENtities (i.e., KITTEN). Using KITTEN, we conduct a systematic study on the fidelity of entities in text-to-image generation models, focusing on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. We evaluate the latest text-to-image models and retrieval-augmented customization models using both automatic metrics and carefully-designed human evaluations, with an emphasis on the fidelity of entities in the generated images. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details. Although retrieval-augmented models can enhance the fidelity of entity by incorporating reference images during testing, they often over-rely on these references and struggle to produce novel configurations of the entity as requested in creative text prompts., Comment: Project page: https://kitten-project.github.io/
Published: 2024

5. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Author: Gou, Boyu, Wang, Ruohan, Zheng, Boyuan, Xie, Yanan, Chang, Cheng, Shu, Yiheng, Sun, Huan, and Su, Yu
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.
Published: 2024

6. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Author: Chen, Ziru, Chen, Shijie, Ning, Yuting, Zhang, Qianheng, Wang, Boshi, Yu, Botao, Li, Yifei, Liao, Zeyi, Wei, Chen, Lu, Zitong, Dey, Vishal, Xue, Mingyi, Baker, Frazier N., Burns, Benjamin, Adu-Ampratwum, Daniel, Huang, Xuhui, Ning, Xia, Gao, Song, Su, Yu, and Sun, Huan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1 with direct prompting and self-debug, which demonstrates the effectiveness of increasing inference-time compute. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research., Comment: 57 pages
Published: 2024

7. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Author: Chen, Shijie, Gutiérrez, Bernal Jiménez, and Su, Yu
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ($O(1)$) forward passes to re-rank $N$ documents, making it substantially more efficient than generative re-ranking methods that require at least $O(N)$ forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.
Published: 2024

8. Fine-Tuning is Fine, if Calibrated

Author: Mai, Zheda, Chowdhury, Arpita, Zhang, Ping, Tu, Cheng-Hao, Chen, Hong-You, Pahuja, Vardaan, Berger-Wolf, Tanya, Gao, Song, Stewart, Charles, Su, Yu, and Chao, Wei-Lun
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Fine-tuning is arguably the most straightforward way to tailor a pre-trained model (e.g., a foundation model) to downstream applications, but it also comes with the risk of losing valuable knowledge the model had learned in pre-training. For example, fine-tuning a pre-trained classifier capable of recognizing a large number of classes to master a subset of classes at hand is shown to drastically degrade the model's accuracy in the other classes it had previously learned. As such, it is hard to further use the fine-tuned model when it encounters classes beyond the fine-tuning data. In this paper, we systematically dissect the issue, aiming to answer the fundamental question, "What has been damaged in the fine-tuned model?" To our surprise, we find that the fine-tuned model neither forgets the relationship among the other classes nor degrades the features to recognize these classes. Instead, the fine-tuned model often produces more discriminative features for these other classes, even if they were missing during fine-tuning! {What really hurts the accuracy is the discrepant logit scales between the fine-tuning classes and the other classes}, implying that a simple post-processing calibration would bring back the pre-trained model's capability and at the same time unveil the feature improvement over all classes. We conduct an extensive empirical study to demonstrate the robustness of our findings and provide preliminary explanations underlying them, suggesting new directions for future theoretical analysis. Our code is available at https://github.com/OSU-MLB/Fine-Tuning-Is-Fine-If-Calibrated., Comment: The paper has been accepted to NeurIPS 2024. The first three authors contribute equally
Published: 2024

9. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Author: Yue, Xiang, Zheng, Tianyu, Ni, Yuansheng, Wang, Yubo, Zhang, Kai, Tong, Shengbang, Sun, Yuxuan, Yu, Botao, Zhang, Ge, Sun, Huan, Su, Yu, Chen, Wenhu, and Neubig, Graham
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
Published: 2024

10. Extended dissipaton-equation-of-motion approach to study the electronic migration in adatom-graphene composite

Author: Su, Yu, Wang, Yao, Zhu, Zi-Fan, Kong, Yuan, Xu, Rui-Xue, Yan, YiJing, and Zheng, Xiao
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics, Physics - Chemical Physics
Abstract: Graphene has garnered significant attention due to its unique properties. Among its many intriguing characteristics, the tuning effects induced by adsorbed atoms (adatoms) provide immense potential for the design of graphene-based electronic devices. This work explores the electronic migration in the adatom-graphene composite, using the extended dissipaton-equation-of-motion (DEOM) approach. As an exact dynamics theory for open quantum systems embedded in environments composed of non-interacting electrons, the extended DEOM is capable of handling both linear and quadratic environmental couplings (a certain non-Gaussian effect) which account for the interactions between the adatom and the graphene substrate. We demonstrate and analyze the adatom-graphene correlated properties and the tuning effects by simulating the adatom spectral functions with varied Coulomb repulsion strengths. This work offers not only advanced theoretical methods but also new insights into the theoretical investigation of complex functional materials such as graphene-based electronic devices., Comment: 8 pages, 5 figures
Published: 2024

11. VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

Author: Maruf, M., Daw, Arka, Mehrab, Kazi Sajeed, Manogaran, Harish Babu, Neog, Abhilash, Sawhney, Medha, Khurana, Mridul, Balhoff, James P., Bakis, Yasin, Altintas, Bahadir, Thompson, Matthew J., Campolongo, Elizabeth G., Uyeda, Josef C., Lapp, Hilmar, Bart, Henry L., Mabee, Paula M., Su, Yu, Chao, Wei-Lun, Stewart, Charles, Berger-Wolf, Tanya, Dahdul, Wasila, and Karpatne, Anuj
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at https://github.com/sammarfy/VLM4Bio., Comment: 36 pages, 37 figures, 7 tables
Published: 2024

12. RePair: Automated Program Repair with Process-based Feedback

Author: Zhao, Yuze, Huang, Zhenya, Ma, Yixiao, Li, Rui, Zhang, Kai, Jiang, Hao, Liu, Qi, Zhu, Linbo, and Su, Yu
Subjects: Computer Science - Software Engineering, Computer Science - Computation and Language
Abstract: The gap between the trepidation of program reliability and the expense of repairs underscores the indispensability of Automated Program Repair (APR). APR is instrumental in transforming vulnerable programs into more robust ones, bolstering program reliability while simultaneously diminishing the financial burden of manual repairs. Commercial-scale language models (LM) have taken APR to unprecedented levels. However, the emergence reveals that for models fewer than 100B parameters, making single-step modifications may be difficult to achieve the desired effect. Moreover, humans interact with the LM through explicit prompts, which hinders the LM from receiving feedback from compiler and test cases to automatically optimize its repair policies. In this literature, we explore how small-scale LM (less than 20B) achieve excellent performance through process supervision and feedback. We start by constructing a dataset named CodeNet4Repair, replete with multiple repair records, which supervises the fine-tuning of a foundational model. Building upon the encouraging outcomes of reinforcement learning, we develop a reward model that serves as a critic, providing feedback for the fine-tuned LM's action, progressively optimizing its policy. During inference, we require the LM to generate solutions iteratively until the repair effect no longer improves or hits the maximum step limit. The results show that process-based not only outperforms larger outcome-based generation methods, but also nearly matches the performance of closed-source commercial large-scale LMs., Comment: 15 pages, 13 figures
Published: 2024

13. Characteristic Performance Study on Solving Oscillator ODEs via Soft-constrained Physics-informed Neural Network with Small Data

Author: Lu, Kai-liang, Su, Yu-meng, Bi, Zhuo, Qiu, Cheng, and Zhang, Wen-jun
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning, 68T07, I.5
Abstract: This paper compared physics-informed neural network (PINN), conventional neural network (NN) and traditional numerical discretization methods on solving differential equations (DEs) through literature investigation and experimental validation. We focused on the soft-constrained PINN approach and formalized its mathematical framework and computational flow for solving Ordinary DEs and Partial DEs (ODEs/PDEs). The working mechanism and its accuracy and efficiency were experimentally verified by solving typical linear and non-linear oscillator ODEs. We demonstrate that the DeepXDE-based implementation of PINN is not only light code and efficient in training, but also flexible across CPU/GPU platforms. PINN greatly reduces the need for labeled data: when the nonlinearity of the ODE is weak, a very small amount of supervised training data plus a few unsupervised collocation points are sufficient to predict the solution; in the minimalist case, only one or two training points (with initial values) are needed for first- or second-order ODEs, respectively. We also find that, with the aid of collocation points and the use of physical information, PINN has the ability to extrapolate data outside the time domain of the training set, and especially is robust to noisy data, thus with enhanced generalization capabilities. Training is accelerated when the gains obtained along with the reduction in the amount of data outweigh the delay caused by the increase in the loss function terms. The soft-constrained PINN can easily impose a physical law (e.g., conservation of energy) constraint by adding a regularization term to the total loss function, thus improving the solution performance to ODEs that obey this physical law. Furthermore, PINN can also be used for stiff ODEs, PDEs, and other types of DEs, and is becoming a favorable catalyst for the era of Digital Twins., Comment: 24 pages, 7 figures, 2 tables, etc. Ready for submission
Published: 2024

14. Imagen 3

Author: Imagen-Team-Google, Baldridge, Jason, Bauer, Jakob, Bhutani, Mukul, Brichtova, Nicole, Bunner, Andrew, Chan, Kelvin, Chen, Yichang, Dieleman, Sander, Du, Yuqing, Eaton-Rosen, Zach, Fei, Hongliang, de Freitas, Nando, Gao, Yilin, Gladchenko, Evgeny, Colmenarejo, Sergio Gómez, Guo, Mandy, Haig, Alex, Hawkins, Will, Hu, Hexiang, Huang, Huilian, Igwe, Tobenna Peter, Kaplanis, Christos, Khodadadeh, Siavash, Kim, Yelin, Konyushkova, Ksenia, Langner, Karol, Lau, Eric, Luo, Shixin, Mokrá, Soňa, Nandwani, Henna, Onoe, Yasumasa, Oord, Aäron van den, Parekh, Zarana, Pont-Tuset, Jordi, Qi, Hang, Qian, Rui, Ramachandran, Deepak, Rane, Poorva, Rashwan, Abdullah, Razavi, Ali, Riachi, Robert, Srinivasan, Hansa, Srinivasan, Srivatsan, Strudel, Robin, Uria, Benigno, Wang, Oliver, Wang, Su, Waters, Austin, Wolff, Chris, Wright, Auriel, Xiao, Zhisheng, Xiong, Hao, Xu, Keyang, van Zee, Marc, Zhang, Junlin, Zhang, Katie, Zhou, Wenlei, Zolna, Konrad, Aboubakar, Ola, Akbulut, Canfer, Akerlund, Oscar, Albuquerque, Isabela, Anderson, Nina, Andreetto, Marco, Aroyo, Lora, Bariach, Ben, Barker, David, Ben, Sherry, Berman, Dana, Biles, Courtney, Blok, Irina, Botadra, Pankil, Brennan, Jenny, Brown, Karla, Buckley, John, Bunel, Rudy, Bursztein, Elie, Butterfield, Christina, Caine, Ben, Carpenter, Viral, Casagrande, Norman, Chang, Ming-Wei, Chang, Solomon, Chaudhuri, Shamik, Chen, Tony, Choi, John, Churbanau, Dmitry, Clement, Nathan, Cohen, Matan, Cole, Forrester, Dektiarev, Mikhail, Du, Vincent, Dutta, Praneet, Eccles, Tom, Elue, Ndidi, Feden, Ashley, Fruchter, Shlomi, Garcia, Frankie, Garg, Roopal, Ge, Weina, Ghazy, Ahmed, Gipson, Bryant, Goodman, Andrew, Górny, Dawid, Gowal, Sven, Gupta, Khyatti, Halpern, Yoni, Han, Yena, Hao, Susan, Hayes, Jamie, Hertz, Amir, Hirst, Ed, Hou, Tingbo, Howard, Heidi, Ibrahim, Mohamed, Ike-Njoku, Dirichi, Iljazi, Joana, Ionescu, Vlad, Isaac, William, Jana, Reena, Jennings, Gemma, Jenson, Donovon, Jia, Xuhui, Jones, Kerry, Ju, Xiaoen, Kajic, Ivana, Ayan, Burcu Karagol, Kelly, Jacob, Kothawade, Suraj, Kouridi, Christina, Ktena, Ira, Kumakaw, Jolanda, Kurniawan, Dana, Lagun, Dmitry, Lavitas, Lily, Lee, Jason, Li, Tao, Liang, Marco, Li-Calis, Maggie, Liu, Yuchi, Alberca, Javier Lopez, Lu, Peggy, Lum, Kristian, Ma, Yukun, Malik, Chase, Mellor, John, Mosseri, Inbar, Murray, Tom, Nematzadeh, Aida, Nicholas, Paul, Oliveira, João Gabriel, Ortiz-Jimenez, Guillermo, Paganini, Michela, Paine, Tom Le, Paiss, Roni, Parrish, Alicia, Peckham, Anne, Peswani, Vikas, Petrovski, Igor, Pfaff, Tobias, Pirozhenko, Alex, Poplin, Ryan, Prabhu, Utsav, Qi, Yuan, Rahtz, Matthew, Rashtchian, Cyrus, Rastogi, Charvi, Raul, Amit, Rebuffi, Sylvestre-Alvise, Ricco, Susanna, Riedel, Felix, Robinson, Dirk, Rohatgi, Pankaj, Rosgen, Bill, Rumbley, Sarah, Ryu, Moonkyung, Salgado, Anthony, Singla, Sahil, Schroff, Florian, Schumann, Candice, Shah, Tanmay, Shillingford, Brendan, Shivakumar, Kaushik, Shtatnov, Dennis, Singer, Zach, Sluzhaev, Evgeny, Sokolov, Valerii, Sottiaux, Thibault, Stimberg, Florian, Stone, Brad, Stutz, David, Su, Yu-Chuan, Tabellion, Eric, Tang, Shuai, Tao, David, Thomas, Kurt, Thornton, Gregory, Toor, Andeep, Udrescu, Cristian, Upadhyay, Aayush, Vasconcelos, Cristina, Vasiloff, Alex, Voynov, Andrey, Walker, Amanda, Wang, Luyu, Wang, Miaosen, Wang, Simon, Wang, Stanley, Wang, Qifei, Wang, Yuxiao, Weisz, Ágoston, Wiles, Olivia, Wu, Chenxia, Xu, Xingyu Federico, Xue, Andrew, Yang, Jianbo, Yu, Luo, Yurtoglu, Mete, Zand, Ali, Zhang, Han, Zhang, Jiageng, Zhao, Catherine, Zhaxybay, Adilet, Zhou, Miao, Zhu, Shengqi, Zhu, Zhenkai, Bloxwich, Dawn, Bordbar, Mahyar, Cobo, Luis C., Collins, Eli, Dai, Shengyang, Doshi, Tulsee, Dragan, Anca, Eck, Douglas, Hassabis, Demis, Hsiao, Sissie, Hume, Tom, Kavukcuoglu, Koray, King, Helen, Krawczyk, Jack, Li, Yeqing, Meier-Hellstern, Kathy, Orban, Andras, Pinsky, Yury, Subramanya, Amar, Vinyals, Oriol, Yu, Ting, and Zwols, Yori
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.
Published: 2024

15. VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Author: Liu, Xiao, Zhang, Tianjie, Gu, Yu, Iong, Iat Long, Xu, Yifan, Song, Xixuan, Zhang, Shudan, Lai, Hanyu, Liu, Xinyi, Zhao, Hanlin, Sun, Jiadai, Yang, Xinyue, Yang, Yu, Qi, Zehan, Yao, Shuntian, Sun, Xueqiao, Cheng, Siyi, Zheng, Qinkai, Yu, Hao, Zhang, Hanchen, Hong, Wenyi, Ding, Ming, Pan, Lihang, Gu, Xiaotao, Zeng, Aohan, Du, Zhengxiao, Song, Chan Hee, Su, Yu, Dong, Yuxiao, and Tang, Jie
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across nine proprietary LMM APIs and eight open models, we demonstrate the considerable yet still developing agent capabilities of these models. Additionally, VAB constructs a trajectory training set constructed through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, promoting substantial performance improvements in LMMs through behavior cloning. Our work not only aims to benchmark existing models but also provides a solid foundation for future development into visual foundation agents. Code, train \& test data, and part of fine-tuned open LMMs are available at \url{https://github.com/THUDM/VisualAgentBench}.
Published: 2024

16. Stability of Quantum Systems beyond Canonical Typicality

Author: Su, Yu, Zhu, Zi-Fan, Wang, Yao, Xu, Rui-Xue, and Yan, YiJing
Subjects: Quantum Physics, Physics - Chemical Physics
Abstract: Involvement of the environment is indispensable for establishing the statistical distribution of system. We analyze the statistical distribution of a quantum system coupled strongly with a heat bath. This distribution is determined by tracing over the bath's degrees of freedom for the equilibrium system-plus-bath composite. The stability of system distribution is largely affected by the system--bath interaction strength. We propose that the quantum system exhibits a stable distribution only when its system response function in the frequency domain satisfies $\tilde\chi(\omega = 0+)>0$. We show our results by investigating the non-interacting bosonic impurity system from both the thermodynamic and dynamic perspectives. Our study refines the theoretical framework of canonical statistics, offering insights into thermodynamic phenomena in small-scale systems., Comment: 5 pages, 4 figures
Published: 2024

17. Memory Kernel Coupling Theory: Obtain Time Correlation Function from Higher-order Moments

Author: Liu, Wei, Su, Yu, Wang, Yao, and Dou, Wenjie
Subjects: Physics - Chemical Physics, Condensed Matter - Statistical Mechanics, Quantum Physics
Abstract: Dynamical observables can often be described by time correlation functions (TCFs). However, efficiently calculating TCFs for complex quantum systems is a significant challenge, which generally requires solving the full dynamics of the systems. This Letter presents the memory kernel coupling theory (MKCT), a general formalism for evaluating TCFs. The MKCT builds upon Mori's memory kernel formalism for TCFs. Our theory further decomposes the memory kernel into auxiliary kernels. Rapid decay of auxiliary kernels allows us to truncate the coupled equations of motion with high accuracy. Notably, only higher-order moments are sufficient as the input for obtaining TCFs. While this formalism is general, we carry out the numerical demonstration for a typical open quantum system--the spin-boson model.
Published: 2024

18. Complex scalar dark matter in a new gauged U(1) symmetry with kinetic and direct mixings

Author: Su, Yu-Hang, Cai, Chengfeng, Zeng, Yu-Pan, and Zhang, Hong-Hao
Subjects: High Energy Physics - Phenomenology, Astrophysics - Cosmology and Nongalactic Astrophysics, High Energy Physics - Experiment
Abstract: We propose a scalar dark matter model featuring a hidden gauge symmetry, denoted as U(1)_X, with two complex scalars, Phi and S. In this framework, Phi spontaneously breaks the U(1)_X gauge symmetry, while S serves as a viable dark matter candidate. Particularly, the kinetic and direct mixings between the U(1)_X and U(1)_Y gauge groups provide a portal between dark matter and the Standard Model particles. These mixings offer a plausible explanation for the W boson mass anomaly observed by the CDF Collaboration. We study the comprehensive phenomenological constraints of this model from colliders and dark matter detection experiments, including Z' searches at the LHC, the 125 GeV Higgs boson measurements, the relic density of dark matter and the indirect detection of dark matter annihilation. By randomly scanning the parameter space, we find that the regions where m_(Z') > 4750 GeV and m_(Z') < 4750 GeV for g_x close to 1 remain viable and can be tested by future experiments., Comment: 22 pages, 10 figures
Published: 2024

19. Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Author: Wang, Boshi, Yue, Xiang, Su, Yu, and Sun, Huan
Subjects: Computer Science - Computation and Language
Abstract: We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning., Comment: NeurIPS 2024. Code and data: https://github.com/OSU-NLP-Group/GrokkedTransformer
Published: 2024

20. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

Author: Gutiérrez, Bernal Jiménez, Shu, Yiheng, Gu, Yu, Yasunaga, Michihiro, and Su, Yu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In order to thrive in hostile and ever-changing natural environments, mammalian brains evolved to store large amounts of knowledge about the world and continually integrate new information while avoiding catastrophic forgetting. Despite the impressive accomplishments, large language models (LLMs), even with retrieval-augmented generation (RAG), still struggle to efficiently and effectively integrate a large amount of new experiences after pre-training. In this work, we introduce HippoRAG, a novel retrieval framework inspired by the hippocampal indexing theory of human long-term memory to enable deeper and more efficient knowledge integration over new experiences. HippoRAG synergistically orchestrates LLMs, knowledge graphs, and the Personalized PageRank algorithm to mimic the different roles of neocortex and hippocampus in human memory. We compare HippoRAG with existing RAG methods on multi-hop question answering and show that our method outperforms the state-of-the-art methods remarkably, by up to 20%. Single-step retrieval with HippoRAG achieves comparable or better performance than iterative retrieval like IRCoT while being 10-30 times cheaper and 6-13 times faster, and integrating HippoRAG into IRCoT brings further substantial gains. Finally, we show that our method can tackle new types of scenarios that are out of reach of existing methods. Code and data are available at https://github.com/OSU-NLP-Group/HippoRAG.
Published: 2024

21. A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

Author: Kim, Gwanghyun, Martinez, Alonso, Su, Yu-Chuan, Jou, Brendan, Lezama, José, Gupta, Agrim, Yu, Lijun, Jiang, Lu, Jansen, Aren, Walker, Jacob, and Somandepalli, Krishna
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space.Our key contribution lies in how we parameterize the diffusion timestep in the forward diffusion process. Instead of the standard fixed diffusion timestep, we propose applying variable diffusion timesteps across the temporal dimension and across modalities of the inputs. This formulation offers flexibility to introduce variable noise levels for various portions of the input, hence the term mixture of noise levels. We propose a transformer-based audiovisual latent diffusion model and show that it can be trained in a task-agnostic fashion using our approach to enable a variety of audiovisual generation tasks at inference time. Experiments demonstrate the versatility of our method in tackling cross-modal and multimodal interpolation tasks in the audiovisual space. Notably, our proposed approach surpasses baselines in generating temporally and perceptually consistent samples conditioned on the input. Project page: avdit2024.github.io
Published: 2024

22. Reconfigurable Massive MIMO: Precoding Design and Channel Estimation in the Electromagnetic Domain

Author: Ying, Keke, Gao, Zhen, Su, Yu, Qin, Tong, Matthaiou, Michail, and Schober, Robert
Subjects: Computer Science - Information Theory, Electrical Engineering and Systems Science - Signal Processing
Abstract: Reconfigurable massive multiple-input multiple-output (RmMIMO), as an electronically-controlled fluid antenna system, offers increased flexibility for future communication systems by exploiting previously untapped degrees of freedom in the electromagnetic (EM) domain. The representation of the traditional spatial domain channel state information (sCSI) limits the insights into the potential of EM domain channel properties, constraining the base station's (BS) utmost capability for precoding design. This paper leverages the EM domain channel state information (eCSI) for antenna radiation pattern design at the BS. We develop an orthogonal decomposition method based on spherical harmonic functions to decompose the radiation pattern into a linear combination of orthogonal bases. By formulating the radiation pattern design as an optimization problem for the projection coefficients over these bases, we develop a manifold optimization-based method for iterative radiation pattern and digital precoder design. To address the eCSI estimation problem, we capitalize on the inherent structure of the channel. Specifically, we propose a subspace-based scheme to reduce the pilot overhead for wideband sCSI estimation. Given the estimated full-band sCSI, we further employ parameterized methods for angle of arrival estimation. Subsequently, the complete eCSI can be reconstructed after estimating the equivalent channel gain via the least squares method. Simulation results demonstrate that, in comparison to traditional mMIMO systems with fixed antenna radiation patterns, the proposed RmMIMO architecture offers significant throughput gains for multi-user transmission at a low channel estimation overhead., Comment: This work has been accepted by IEEE Transactions on Communications
Published: 2024

23. Quantum Mechanics of Open Systems in Non-Inertial Motion

Author: Zhu, Zi-Fan, Su, Yu, Wang, Yao, Xu, Rui-Xue, and Yan, YiJing
Subjects: Quantum Physics, Physics - Atomic Physics
Abstract: The study of quantum mechanics in non-inertial reference frames, particularly in the context of open systems, introduces several intriguing phenomena and challenges. This paper presents a comprehensive framework for analyzing the quantum mechanics of open systems undergoing noninertial motion. Our methodology leverages the concept of dissipatons, statistical quasi-particles that capture collective dissipative effects from the environment. We demonstrate that our approach offers a natural understanding of the intricate dynamics among non-inertial effects, decoherence, dissipation, and system-bath entanglement. Specifically, we conduct demonstrations focusing on the Lamb shift phenomenon within a rotating ring cavity. Through theoretical exposition and practical applications, our framework elucidates the profound interplay between open quantum dynamics and non-inertial motion, paving the way for advancements in quantum information processing and sensing technologies., Comment: 7 pages, 1 figure
Published: 2024

24. Spin-lattice relaxation with non-linear couplings: Comparison between Fermi's golden rule and extended dissipaton equation of motion

Author: Bi, Rui-Hao, Su, Yu, Wang, Yao, Sun, Lei, and Dou, Wenjie
Subjects: Physics - Chemical Physics, Quantum Physics
Abstract: Fermi's golden rule (FGR) offers an empirical framework for understanding the dynamics of spin-lattice relaxation in magnetic molecules, encompassing mechanisms like direct (one-phonon) and Raman (two-phonon) processes. These principles effectively model experimental longitudinal relaxation rates, denoted as $T_1^{-1}$. However, under scenarios of increased coupling strength and nonlinear spin-lattice interactions, FGR's applicability may diminish. This paper numerically evaluates the exact spin-lattice relaxation rate kernels, employing the extended dissipaton equation of motion (DEOM) formalism. Our calculations reveal that when quadratic spin-lattice coupling is considered, the rate kernels exhibit a free induction decay-like feature, and the damping rates depend on the interaction strength. We observe that the temperature dependence predicted by FGR significantly deviates from the exact results since FGR ignores the non-Markovian nature of spin-lattice relaxation. Our methods can be readily applied to other systems with nonlinear spin-lattice interactions and provide valuable insights into the temperature dependence of $T_1$ in molecular qubits., Comment: 10 pages, 5 figures
Published: 2024
Full Text: View/download PDF

25. MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Author: Zhang, Kai, Luan, Yi, Hu, Hexiang, Lee, Kenton, Qiao, Siyuan, Chen, Wenhu, Su, Yu, and Chang, Ming-Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Information Retrieval, Computer Science - Multimedia
Abstract: Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/., Comment: ICML 2024 (Oral); Project Website: https://open-vision-language.github.io/MagicLens/
Published: 2024

26. Knowledge and Data Dual-Driven Channel Estimation and Feedback for Ultra-Massive MIMO Systems under Hybrid Field Beam Squint Effect

Author: Wang, Kuiyu, Gao, Zhen, Chen, Sheng, Ning, Boyu, Chen, Gaojie, Su, Yu, Wang, Zhaocheng, and Poor, H. Vincent
Subjects: Computer Science - Information Theory, Electrical Engineering and Systems Science - Signal Processing
Abstract: Acquiring accurate channel state information (CSI) at an access point (AP) is challenging for wideband millimeter wave (mmWave) ultra-massive multiple-input and multiple-output (UMMIMO) systems, due to the high-dimensional channel matrices, hybrid near- and far- field channel feature, beam squint effects, and imperfect hardware constraints, such as low-resolution analog-to-digital converters, and in-phase and quadrature imbalance. To overcome these challenges, this paper proposes an efficient downlink channel estimation (CE) and CSI feedback approach based on knowledge and data dual-driven deep learning (DL) networks. Specifically, we first propose a data-driven residual neural network de-quantizer (ResNet-DQ) to pre-process the received pilot signals at user equipment (UEs), where the noise and distortion brought by imperfect hardware can be mitigated. A knowledge-driven generalized multiple measurement vector learned approximate message passing (GMMV-LAMP) network is then developed to jointly estimate the channels by exploiting the approximately same physical angle shared by different subcarriers. In particular, two wideband redundant dictionaries (WRDs) are proposed such that the measurement matrices of the GMMV-LAMP network can accommodate the far-field and near-field beam squint effect, respectively. Finally, we propose an encoder at the UEs and a decoder at the AP by a data-driven CSI residual network (CSI-ResNet) to compress the CSI matrix into a low-dimensional quantized bit vector for feedback, thereby reducing the feedback overhead substantially. Simulation results show that the proposed knowledge and data dual-driven approach outperforms conventional downlink CE and CSI feedback methods, especially in the case of low signal-to-noise ratios., Comment: 17 pages, 22 figures, 3 tables
Published: 2024

27. LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

Author: Wang, Boshi, Fang, Hao, Eisner, Jason, Van Durme, Benjamin, and Su, Yu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy., Comment: Code and data available at https://github.com/microsoft/simulated-trial-and-error
Published: 2024

28. Authors’ Reply to: Methodological Clarifications and Generalizing From Weibo Data. Comment on 'Nature and Diffusion of COVID-19–related Oral Health Information on Chinese Social Media: Analysis of Tweets on Weibo'

Author: Tao, Zhuo-Ying and Su, Yu-Xiong
Subjects: Computer applications to medicine. Medical informatics, R858-859.7, Public aspects of medicine, RA1-1270
Published: 2021
Full Text: View/download PDF

29. Nature and Diffusion of COVID-19–related Oral Health Information on Chinese Social Media: Analysis of Tweets on Weibo

Author: Tao, Zhuo-Ying, Chu, Guang, McGrath, Colman, Hua, Fang, Leung, Yiu Yan, Yang, Wei-Fa, and Su, Yu-Xiong
Subjects: Computer applications to medicine. Medical informatics, R858-859.7, Public aspects of medicine, RA1-1270
Abstract: BackgroundSocial media has become increasingly important as a source of information for the public and is widely used for health-related information. The outbreak of the coronavirus disease (COVID-19) has exerted a negative impact on dental practices. ObjectiveThe aim of this study is to analyze the nature and diffusion of COVID-19–related oral health information on the Chinese social media site Weibo. MethodsA total of 15,900 tweets related to oral health and dentistry information from Weibo during the COVID-19 outbreak in China (December 31, 2019, to March 16, 2020) were included in our study. Two researchers coded 1000 of the total tweets in advance, and two main thematic categories with eight subtypes were refined. The included tweets were analyzed over time and geographic region, and coded into eight thematic categories. Additionally, the time distributions of tweets containing information about dental services, needs of dental treatment, and home oral care during the COVID-19 epidemic were further analyzed. ResultsPeople reacted rapidly to the emerging severe acute respiratory syndrome coronavirus 2 threat to dental services, and a large amount of COVID-19–related oral health information was tweeted on Weibo. The time and geographic distribution of tweets shared similarities with epidemiological data of the COVID-19 outbreak in China. Tweets containing home oral care and dental services content were the most frequently exchanged information (n=4803/15,900, 30.20% and n=4478, 28.16%, respectively). Significant differences of public attention were found between various types of bloggers in dental services–related tweets (P
Published: 2020
Full Text: View/download PDF

30. Gait-Based Privacy Protection for Smart Wearable Devices

Author: Su, Yu, Li, Yongjiao, and Cao, Zhu
Subjects: Computer Science - Cryptography and Security
Abstract: Smart wearable devices (SWDs) collect and store sensitive daily information of many people. Its primary method of identification is still the password unlocking method. However, several studies have shown serious security flaws in that method, which makes the privacy and security concerns of SWDs particularly urgent. Gait identification is well suited for SWDs because its built-in sensors can provide data support for identification. However, existing gait identification methods have low accuracy and neglect to protect the privacy of gait features. In addition, the SWD can be used as an internet of things device for users to share data. But few studies have used gait feature-based encryption schemes to protect the privacy of message interactions between SWDs and other devices. In this paper, we propose a gait identification network, a bi-directional long short-term memory network with an attention mechanism (ABLSTM), to improve the identification accuracy and a stochastic orthogonal transformation (SOT) scheme to protect the extracted gait features from leakage. In the experiments, ABLSTM achieves an accuracy of 95.28%, reducing previous error rate by 19.3%. The SOT scheme is proved to be resistant to the chosen plaintext attack (CPA) and is 30% faster than previous methods. A biometric-based encryption scheme is proposed to enable secure message interactions using gait features as keys after the gait identification stage is passed, and offers better protection of the gait features compared to previous schemes., Comment: 13 pages, 12 figures
Published: 2024
Full Text: View/download PDF

31. Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments

Author: Gu, Yu, Shu, Yiheng, Yu, Hao, Liu, Xiao, Dong, Yuxiao, Tang, Jie, Srinivasa, Jayanth, Latapie, Hugo, and Su, Yu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, I.2.7
Abstract: The applications of large language models (LLMs) have expanded well beyond the confines of text processing, signaling a new era where LLMs are envisioned as generalist agents capable of operating within complex environments. These environments are often highly expansive, making it impossible for the LLM to process them within its short-term memory. Motivated by recent research on extending the capabilities of LLMs with tools, we seek to investigate the intriguing potential of tools to augment LLMs in handling such complexity by introducing a novel class of tools, termed middleware, to aid in the proactive exploration within these massive environments. Such specialized tools can serve as a middleware layer shielding the LLM from environmental complexity. In two representative complex environments -- knowledge bases (KBs) and databases -- we demonstrate the significant potential of augmenting language agents with tools in complex environments. Notably, equipped with the middleware, GPT-4 achieves 2.8X the performance of the best baseline in tasks requiring access to database content and 2.2X in KB tasks. Our findings illuminate the path for advancing language agents in real-world applications., Comment: EMNLP'2024; 18 pages, 8 figures, 8 tables
Published: 2024

32. When is Tree Search Useful for LLM Planning? It Depends on the Discriminator

Author: Chen, Ziru, White, Michael, Mooney, Raymond, Payani, Ali, Su, Yu, and Sun, Huan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: In this paper, we examine how large language models (LLMs) solve multi-step problems under a language agent framework with three components: a generator, a discriminator, and a planning method. We investigate the practical utility of two advanced planning methods, iterative correction and tree search. We present a comprehensive analysis of how discrimination accuracy affects the overall performance of agents when using these two methods or a simpler method, re-ranking. Experiments on two tasks, text-to-SQL parsing and mathematical reasoning, show that: (1) advanced planning methods demand discriminators with at least 90% accuracy to achieve significant improvements over re-ranking; (2) current LLMs' discrimination abilities have not met the needs of advanced planning methods to achieve such improvements; (3) with LLM-based discriminators, advanced planning methods may not adequately balance accuracy and efficiency. For example, compared to the other two methods, tree search is at least 10--20 times slower but leads to negligible performance gains, which hinders its real-world applications. Code and data are available at https://github.com/OSU-NLP-Group/llm-planning-eval., Comment: ACL 2024 main
Published: 2024

33. A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

Author: Mo, Lingbo, Liao, Zeyi, Zheng, Boyuan, Su, Yu, Xiao, Chaowei, and Sun, Huan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Language agents powered by large language models (LLMs) have seen exploding development. Their capability of using language as a vehicle for thought and communication lends an incredible level of flexibility and versatility. People have quickly capitalized on this capability to connect LLMs to a wide range of external components and environments: databases, tools, the Internet, robotic embodiment, etc. Many believe an unprecedentedly powerful automation technology is emerging. However, new automation technologies come with new safety risks, especially for intricate systems like language agents. There is a surprisingly large gap between the speed and scale of their development and deployment and our understanding of their safety risks. Are we building a house of cards? In this position paper, we present the first systematic effort in mapping adversarial attacks against language agents. We first present a unified conceptual framework for agents with three major components: Perception, Brain, and Action. Under this framework, we present a comprehensive discussion and propose 12 potential attack scenarios against different components of an agent, covering different attack strategies (e.g., input manipulation, adversarial demonstrations, jailbreaking, backdoors). We also draw connections to successful attack strategies previously applied to LLMs. We emphasize the urgency to gain a thorough understanding of language agent risks before their widespread deployment.
Published: 2024

34. Dual-View Visual Contextualization for Web Navigation

Author: Kil, Jihyung, Song, Chan Hee, Zheng, Boyuan, Deng, Xiang, Su, Yu, and Chao, Wei-Lun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (i.e., actionable elements and operations) of webpages. Nevertheless, HTML documents may not provide a clear task-related context for each element, making it hard to select the right (sequence of) actions. In this paper, we propose to contextualize HTML elements through their "dual views" in webpage screenshots: each HTML element has its corresponding bounding box and visual content in the screenshot. We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences -- and propose to contextualize each element with its neighbor elements, using both textual and visual features. The resulting representations of HTML elements are more informative for the agent to take action. We validate our method on the recently released Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. Our method consistently outperforms the baseline in all the scenarios, including cross-task, cross-website, and cross-domain ones., Comment: Accepted to CVPR 2024
Published: 2024

35. TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Author: Xie, Jian, Zhang, Kai, Chen, Jiangjie, Zhu, Tinghui, Lou, Renze, Tian, Yuandong, Xiao, Yanghua, and Su, Yu
Subjects: Computer Science - Computation and Language
Abstract: Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents., Comment: ICML 2024 (Spotlight)
Published: 2024

36. Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning

Author: Zhu, Tinghui, Zhang, Kai, Xie, Jian, and Su, Yu
Subjects: Computer Science - Computation and Language
Abstract: Recent advancements have significantly augmented the reasoning capabilities of Large Language Models (LLMs) through various methodologies, especially chain-of-thought (CoT) reasoning. However, previous methods fail to address reasoning errors in intermediate steps, leading to accumulative errors. In this paper, we propose Deductive Beam Search (DBS), which seamlessly integrates CoT and deductive reasoning with step-wise beam search for LLMs. Our approach deploys a verifier, verifying the deducibility of a reasoning step and its premises, thus alleviating the error accumulation. Furthermore, we introduce a scalable and labor-free data construction method to amplify our model's verification capabilities. Extensive experiments demonstrate that our approach significantly enhances the base performance of LLMs of various scales (7B, 13B, 70B, and ChatGPT) across 8 reasoning datasets from 3 diverse reasoning genres, including arithmetic, commonsense, and symbolic. Moreover, our analysis proves DBS's capability of detecting diverse and subtle reasoning errors and robustness on different model scales., Comment: COLM 2024
Published: 2024

37. One-step implementation of nonadiabatic geometric fSim gate in superconducting circuits

Author: Yun, M. -R., Shan, Zheng, Sun, Li-Li, Yan, L. -L., Su, Yu Jia S. -L., and Chen, G.
Subjects: Quantum Physics
Abstract: Due to its significant application in reducing algorithm depth, fSim gates have attracted a lot of attention. However, during the implementation of quantum gates, fluctuations in control parameters and decoherence caused by the environment may lead to a decrease in the fidelity of the gate. Implementing the fSim gate that is robust to these factors in one step remains an unresolved issue. In this manuscript, we propose a one-step implementation of the nonadiabatic geometric fSim gate composed of a nonadiabatic holonomic controlled phase (CP) gate and a nonadiabatic noncyclic geometric iSWAP gate with parallel paths in a tunable superconducting circuit. Compared to the composite nonadiabatic geometric fSim gate composed of a nonadiabatic holonomic CP gate and a nonadiabatic geometric iSWAP gate, our scheme only takes half the time and demonstrates robustness to parameter fluctuations, as well as to environmental impacts. Moreover, the scheme does not require complex controls, making it very easy to implement in experiments, and can be achieved in various circuit structures. Our scheme may provide a promising path toward quantum computation and simulation.
Published: 2024

38. GPT-4V(ision) is a Generalist Web Agent, if Grounded

Author: Zheng, Boyuan, Gou, Boyu, Kil, Jihyung, Sun, Huan, and Su, Yu
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.
Published: 2024

39. Instruct-Imagen: Image Generation with Multi-modal Instruction

Author: Hu, Hexiang, Chan, Kelvin C. K., Su, Yu-Chuan, Chen, Wenhu, Li, Yandong, Sohn, Kihyuk, Zhao, Yang, Ben, Xue, Gong, Boqing, Cohen, William, Chang, Ming-Wei, and Jia, Xuhui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks., Comment: 20 pages, 18 figures
Published: 2024

40. Reviving the Context: Camera Trap Species Classification as Link Prediction on Multimodal Knowledge Graphs

Author: Pahuja, Vardaan, Luo, Weidi, Gu, Yu, Tu, Cheng-Hao, Chen, Hong-You, Berger-Wolf, Tanya, Stewart, Charles, Gao, Song, Chao, Wei-Lun, and Su, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Camera traps are important tools in animal ecology for biodiversity monitoring and conservation. However, their practical application is limited by issues such as poor generalization to new and unseen locations. Images are typically associated with diverse forms of context, which may exist in different modalities. In this work, we exploit the structured context linked to camera trap images to boost out-of-distribution generalization for species classification tasks in camera traps. For instance, a picture of a wild animal could be linked to details about the time and place it was captured, as well as structured biological knowledge about the animal species. While often overlooked by existing studies, incorporating such context offers several potential benefits for better image understanding, such as addressing data scarcity and enhancing generalization. However, effectively incorporating such heterogeneous context into the visual domain is a challenging problem. To address this, we propose a novel framework that transforms species classification as link prediction in a multimodal knowledge graph (KG). This framework enables the seamless integration of diverse multimodal contexts for visual recognition. We apply this framework for out-of-distribution species classification on the iWildCam2020-WILDS and Snapshot Mountain Zebra datasets and achieve competitive performance with state-of-the-art approaches. Furthermore, our framework enhances sample efficiency for recognizing under-represented species., Comment: 12 pages, 5 figures
Published: 2023
Full Text: View/download PDF

41. Generalized system-bath entanglement theorem for Gaussian environments

Author: Su, Yu, Wang, Yao, Xu, Rui-Xue, and Yan, YiJing
Subjects: Quantum Physics, Physics - Chemical Physics
Abstract: A system-bath entanglement theorem (SBET) with Gaussian environments was established previously in J. Chem. Phys. 152, 034102 (2020) in terms of linear response functions. This theorem connects the system-bath entanglement responses to the local system and bare bath ones. In this work, we generalize it to correlation functions. Key steps in derivation are the generalized Langevin dynamics for the hybridizing bath modes as in the previous work, together with the Bogoliubov transformation mapping the original finite-temperature canonical reservoir to an effective zero-temperature vacuum via an auxiliary bath. With the theorem, the system-bath entangled correlations and the bath modes correlations in the full composite space can be evaluated as long as the bare-bath statistical properties are known and the reduced system correlations are obtained. Numerical demonstrations are carried out for the evaluation of the solvation free energy of an electron transfer system with a certain intramolecular vibrational modes., Comment: 9 pages, 3 figures
Published: 2023

42. Fine-grained Controllable Video Generation via Object Appearance and Context

Author: Huang, Hsin-Ping, Su, Yu-Chuan, Sun, Deqing, Jiang, Lu, Jia, Xuhui, Zhu, Yukun, and Yang, Ming-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines., Comment: Project page: https://hhsinping.github.io/factor
Published: 2023

43. MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

Author: Lou, Renze, Zhang, Kai, Xie, Jian, Sun, Yuxuan, Ahn, Janice, Xu, Hanzi, Su, Yu, and Yin, Wenpeng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes., Comment: ICLR 2024. Data, model, and code are available at: https://renzelou.github.io/Muffin/
Published: 2023

44. Genetic risk impacts the association of menopausal hormone therapy with colorectal cancer risk

Author: Tian, Yu, Lin, Yi, Qu, Conghui, Arndt, Volker, Baurley, James W., Berndt, Sonja I., Bien, Stephanie A., Bishop, D. Timothy, Brenner, Hermann, Buchanan, Daniel D., Budiarto, Arif, Campbell, Peter T., Carreras-Torres, Robert, Casey, Graham, Chan, Andrew T., Chen, Rui, Chen, Xuechen, Conti, David V., Díez-Obrero, Virginia, Dimou, Niki, Drew, David A., Figueiredo, Jane C., Gallinger, Steven, Giles, Graham G., Gruber, Stephen B., Gunter, Marc J., Harlid, Sophia, Harrison, Tabitha A., Hidaka, Akihisa, Hoffmeister, Michael, Huyghe, Jeroen R., Jenkins, Mark A., Jordahl, Kristina M., Joshi, Amit D., Keku, Temitope O., Kawaguchi, Eric, Kim, Andre E., Kundaje, Anshul, Larsson, Susanna C., Marchand, Loic Le, Lewinger, Juan Pablo, Li, Li, Moreno, Victor, Morrison, John, Murphy, Neil, Nan, Hongmei, Nassir, Rami, Newcomb, Polly A., Obón-Santacana, Mireia, Ogino, Shuji, Ose, Jennifer, Pardamean, Bens, Pellatt, Andrew J., Peoples, Anita R., Platz, Elizabeth A., Potter, John D., Prentice, Ross L., Rennert, Gad, Ruiz-Narvaez, Edward A., Sakoda, Lori C., Schoen, Robert E., Shcherbina, Anna, Stern, Mariana C., Su, Yu-Ru, Thibodeau, Stephen N., Thomas, Duncan C., Tsilidis, Konstantinos K., van Duijnhoven, Franzel J. B., Van Guelpen, Bethany, Visvanathan, Kala, White, Emily, Wolk, Alicja, Woods, Michael O., Wu, Anna H., Peters, Ulrike, Gauderman, W. James, Hsu, Li, and Chang-Claude, Jenny
Published: 2024
Full Text: View/download PDF

45. BioCLIP: A Vision Foundation Model for the Tree of Life

Author: Stevens, Samuel, Wu, Jiaman, Thompson, Matthew J, Campolongo, Elizabeth G, Song, Chan Hee, Carlyn, David Edward, Dong, Li, Dahdul, Wasila M, Stewart, Charles, Berger-Wolf, Tanya, Chao, Wei-Lun, and Su, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks and find that BioCLIP consistently and substantially outperforms existing baselines (by 16% to 17% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. https://imageomics.github.io/bioclip has models, data and code., Comment: CVPR 2024 (oral) camera-ready version; data released
Published: 2023

46. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Author: Yue, Xiang, Ni, Yuansheng, Zhang, Kai, Zheng, Tianyu, Liu, Ruoqi, Zhang, Ge, Stevens, Samuel, Jiang, Dongfu, Ren, Weiming, Sun, Yuxuan, Wei, Cong, Yu, Botao, Yuan, Ruibin, Sun, Renliang, Yin, Ming, Zheng, Boyuan, Yang, Zhenzhu, Liu, Yibo, Huang, Wenhao, Sun, Huan, Su, Yu, and Chen, Wenhu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence., Comment: CVPR 2024 Oral
Published: 2023

47. A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

Author: Paul, Dipanjyoti, Chowdhury, Arpita, Xiong, Xinqi, Chang, Feng-Ju, Carlyn, David, Stevens, Samuel, Provost, Kaiya L., Karpatne, Anuj, Carstens, Bryan, Rubenstein, Daniel, Stewart, Charles, Berger-Wolf, Tanya, Su, Yu, and Chao, Wei-Lun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a proactive approach, asking each class to search for itself in an image. We realize this idea via a Transformer encoder-decoder inspired by DEtection TRansformer (DETR). We learn "class-specific" queries (one for each class) as input to the decoder, enabling each class to localize its patterns in an image via cross-attention. We name our approach INterpretable TRansformer (INTR), which is fairly easy to implement and exhibits several compelling properties. We show that INTR intrinsically encourages each class to attend distinctively; the cross-attention weights thus provide a faithful interpretation of the prediction. Interestingly, via "multi-head" cross-attention, INTR could identify different "attributes" of a class, making it particularly suitable for fine-grained classification and analysis, which we demonstrate on eight datasets. Our code and pre-trained models are publicly accessible at the Imageomics Institute GitHub site: https://github.com/Imageomics/INTR., Comment: Accepted to International Conference on Learning Representations 2024 (ICLR 2024)
Published: 2023

48. Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data

Author: Tu, Cheng-Hao, Chen, Hong-You, Mai, Zheda, Zhong, Jike, Pahuja, Vardaan, Berger-Wolf, Tanya, Gao, Song, Stewart, Charles, Su, Yu, and Chao, Wei-Lun
Subjects: Computer Science - Machine Learning
Abstract: We propose a learning problem involving adapting a pre-trained source model to the target domain for classifying all classes that appeared in the source data, using target data that covers only a partial label space. This problem is practical, as it is unrealistic for the target end-users to collect data for all classes prior to adaptation. However, it has received limited attention in the literature. To shed light on this issue, we construct benchmark datasets and conduct extensive experiments to uncover the inherent challenges. We found a dilemma -- on the one hand, adapting to the new target domain is important to claim better performance; on the other hand, we observe that preserving the classification accuracy of classes missing in the target adaptation data is highly challenging, let alone improving them. To tackle this, we identify two key directions: 1) disentangling domain gradients from classification gradients, and 2) preserving class relationships. We present several effective solutions that maintain the accuracy of the missing classes and enhance the overall performance, establishing solid baselines for holistic transfer of pre-trained models with partial target data., Comment: Accepted to NeurIPS 2023 main track
Published: 2023

49. MAAIG: Motion Analysis And Instruction Generation

Author: Yeh, Wei-Hsin, Lin, Pei Hsin, Su, Yu-An, Cheng, Wen Hsiang, and Ku, Lun-Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition, I.2.10, I.2.7
Abstract: Many people engage in self-directed sports training at home but lack the real-time guidance of professional coaches, making them susceptible to injuries or the development of incorrect habits. In this paper, we propose a novel application framework called MAAIG(Motion Analysis And Instruction Generation). It can generate embedding vectors for each frame based on user-provided sports action videos. These embedding vectors are associated with the 3D skeleton of each frame and are further input into a pretrained T5 model. Ultimately, our model utilizes this information to generate specific sports instructions. It has the capability to identify potential issues and provide real-time guidance in a manner akin to professional coaches, helping users improve their sports skills and avoid injuries., Comment: Accepted to the ACM Multimedia Asia 2023 Workshop on Intelligent Sports Technologies (WIST)
Published: 2023
Full Text: View/download PDF

50. FLEE-GNN: A Federated Learning System for Edge-Enhanced Graph Neural Network in Analyzing Geospatial Resilience of Multicommodity Food Flows

Author: Qu, Yuxiao, Rao, Jinmeng, Gao, Song, Zhang, Qianheng, Chao, Wei-Lun, Su, Yu, Miller, Michelle, Morales, Alfonso, and Huber, Patrick
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Social and Information Networks, I.2
Abstract: Understanding and measuring the resilience of food supply networks is a global imperative to tackle increasing food insecurity. However, the complexity of these networks, with their multidimensional interactions and decisions, presents significant challenges. This paper proposes FLEE-GNN, a novel Federated Learning System for Edge-Enhanced Graph Neural Network, designed to overcome these challenges and enhance the analysis of geospatial resilience of multicommodity food flow network, which is one type of spatial networks. FLEE-GNN addresses the limitations of current methodologies, such as entropy-based methods, in terms of generalizability, scalability, and data privacy. It combines the robustness and adaptability of graph neural networks with the privacy-conscious and decentralized aspects of federated learning on food supply network resilience analysis across geographical regions. This paper also discusses FLEE-GNN's innovative data generation techniques, experimental designs, and future directions for improvement. The results show the advancements of this approach to quantifying the resilience of multicommodity food flow networks, contributing to efforts towards ensuring global food security using AI methods. The developed FLEE-GNN has the potential to be applied in other spatial networks with spatially heterogeneous sub-network distributions., Comment: 10 pages, 5 figures
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

7,930 results on '"Su, Yu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources