Author: "Lin, Xudong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lin, Xudong"' showing total 569 results

Start Over Author "Lin, Xudong"

569 results on '"Lin, Xudong"'

1. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Author: Lee, Jinhyuk, Chen, Anthony, Dai, Zhuyun, Dua, Dheeru, Sachan, Devendra Singh, Boratko, Michael, Luan, Yi, Arnold, Sébastien M. R., Perot, Vincent, Dalmia, Siddharth, Hu, Hexiang, Lin, Xudong, Pasupat, Panupong, Amini, Aida, Cole, Jeremy R., Riedel, Sebastian, Naim, Iftekhar, Chang, Ming-Wei, and Guu, Kelvin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale., Comment: 29 pages. Dataset available at https://github.com/google-deepmind/loft
Published: 2024

2. Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Author: Su, Hung-Ting, Chao, Chun-Tong, Hsu, Ya-Ching, Lin, Xudong, Niu, Yulei, Lee, Hung-Yi, and Hsu, Winston H.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM. The dataset and code are available at: https://ander1119.github.io/TiM, Comment: Project page: https://ander1119.github.io/TiM
Published: 2024

3. BLINK: Multimodal Large Language Models Can See but Not Perceive

Author: Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A., Ma, Wei-Chiu, and Krishna, Ranjay
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception., Comment: Multimodal Benchmark, Project Url: https://zeyofu.github.io/blink/, ECCV 2024
Published: 2024

4. SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Author: Niu, Yulei, Guo, Wenliang, Chen, Long, Lin, Xudong, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations., Comment: Accepted by ICLR 2024
Published: 2024

5. Video Summarization: Towards Entity-Aware Captions

Author: Ayyubi, Hammad A., Liu, Tianqi, Nagrani, Arsha, Lin, Xudong, Zhang, Mingda, Arnab, Anurag, Han, Feng, Zhu, Yukun, Liu, Jialu, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.
Published: 2023

6. InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Author: Han, Xiaotian, You, Quanzeng, Liu, Yongfei, Chen, Wentao, Zheng, Huangjie, Mrini, Khalil, Lin, Xudong, Wang, Yiqi, Zhai, Bohan, Yuan, Jianbo, Wang, Heng, and Yang, Hongxia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. In instances where an MLLM is unable to produce a definitive answer, its reasoning ability is evaluated by requesting intermediate reasoning steps. If these steps align with our manual annotations, appropriate scores are assigned. This evaluation scheme resembles methods commonly used in human assessments, such as exams or assignments, and represents what we consider a more effective assessment technique compared with existing benchmarks. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark, designed to challenge and accurately measure their reasoning capabilities. The code and data will be released at https://infimm.github.io/InfiMM-Eval/
Published: 2023

7. Non-Sequential Graph Script Induction via Multimedia Grounding

Author: Zhou, Yu, Li, Sha, Li, Manling, Lin, Xudong, Chang, Shih-Fu, Bansal, Mohit, and Ji, Heng
Subjects: Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating "branching". In this paper, we propose the new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to WikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52% absolute gains on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial sequence completion. Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
Published: 2023

8. Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Author: Su, Hung-Ting, Niu, Yulei, Lin, Xudong, Hsu, Winston H., and Chang, Shih-Fu
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. Existing question synthesis methods pre-trained question generation (QG) systems on reading comprehension datasets with text descriptions as inputs. However, QG models only learn to ask association questions (e.g., ``what is someone doing...'') and result in inferior performance due to the poor transfer of association knowledge to CVidQA, which focuses on causal questions like ``why is someone doing ...''. Observing this, we proposed to exploit causal knowledge to generate question-answer pairs, and proposed a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA. To extract knowledge from LMs, CaKE-LM generates causal questions containing two events with one triggering another (e.g., ``score a goal'' triggers ``soccer player kicking ball'') by prompting LM with the action (soccer player kicking ball) to retrieve the intention (to score a goal). CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets. We also conduct comprehensive analyses and provide key findings for future research., Comment: CVPR 2023 Workshop L3D-IVU
Published: 2023

9. Supervised Masked Knowledge Distillation for Few-Shot Transformers

Author: Lin, Han, Han, Guangxing, Ma, Jiawei, Huang, Shiyuan, Lin, Xudong, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: https://github.com/HL-hanlin/SMKD., Comment: To appear in CVPR 2023
Published: 2023

10. In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

Author: Lu, Andrew, Lin, Xudong, Niu, Yulei, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding event relationships in videos requires a model to understand the underlying structures of events (i.e. the event type, the associated argument roles, and corresponding entities) and factual knowledge for reasoning. Structural symbolic representation (SSR) based methods directly take event types and associated argument roles/entities as inputs to perform reasoning. However, the state-of-the-art video event-relation prediction system shows the necessity of using continuous feature vectors from input videos; existing methods based solely on SSR inputs fail completely, even when given oracle event types and argument roles. In this paper, we conduct an extensive empirical analysis to answer the following questions: 1) why SSR-based method failed; 2) how to understand the evaluation setting of video event relation prediction properly; 3) how to uncover the potential of SSR-based methods. We first identify suboptimal training settings as causing the failure of previous SSR-based video event prediction models. Then through qualitative and quantitative analysis, we show how evaluation that takes only video as inputs is currently unfeasible, as well as the reliance on oracle event information to obtain an accurate evaluation. Based on these findings, we propose to further contextualize the SSR-based model to an Event-Sequence Model and equip it with more factual knowledge through a simple yet effective way of reformulating external visual commonsense knowledge bases into an event-relation prediction pretraining dataset. The resultant new state-of-the-art model eventually establishes a 25% Macro-accuracy performance boost., Comment: CVPRW 23, Learning with Limited Labelled Data
Published: 2023

11. TempCLR: Temporal Alignment Representation with Contrastive Learning

Author: Yang, Yuncong, Ma, Jiawei, Huang, Shiyuan, Chen, Long, Lin, Xudong, Han, Guangxing, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design., Comment: ICLR 2023 Camera Ready. Code Link: https://github.com/yyuncong/TempCLR
Published: 2022

12. Video Event Extraction via Tracking Visual States of Arguments

Author: Yang, Guang, Li, Manling, Zhang, Jiajie, Lin, Xudong, Chang, Shih-Fu, and Ji, Heng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Existing methods focus on capturing the overall visual scene of each frame, ignoring fine-grained argument-level information. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively. Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. In particular, on verb classification, we achieve 3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation Recognition.
Published: 2022

13. Weakly-Supervised Temporal Article Grounding

Author: Chen, Long, Niu, Yulei, Chen, Brian, Lin, Xudong, Han, Guangxing, Thomas, Christopher, Ayyubi, Hammad, Ji, Heng, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptions make today's VG models fail to work in practice. For example, in real-world multimodal assets (eg, news articles), most of the sentences in the article can not be grounded in their affiliated videos, and they typically have rich hierarchical relations (ie, at different semantic scales). To this end, we propose a new challenging grounding task: Weakly-Supervised temporal Article Grounding (WSAG). Specifically, given an article and a relevant video, WSAG aims to localize all ``groundable'' sentences to the video, and these sentences are possibly at different semantic scales. Accordingly, we collect the first WSAG dataset to facilitate this task: YouwikiHow, which borrows the inherent multi-scale descriptions in wikiHow articles and plentiful YouTube videos. In addition, we propose a simple but effective method DualMIL for WSAG, which consists of a two-level MIL loss and a single-/cross- sentence constraint loss. These training objectives are carefully designed for these relaxed assumptions. Extensive ablations have verified the effectiveness of DualMIL., Comment: EMNLP 2022, https://github.com/zjuchenlong/WSAG
Published: 2022

14. Learning to Decompose Visual Features with Latent Textual Prompts

Author: Wang, Feng, Li, Manling, Lin, Xudong, Lv, Hairong, Schwing, Alexander G., and Ji, Heng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feature-level textual prompts. We further use an additional linear layer to perform classification, allowing a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning method by 7.6%.
Published: 2022

15. Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

Author: Ayyubi, Hammad A., Thomas, Christopher, Chum, Lovish, Lokesh, Rahul, Chen, Long, Niu, Yulei, Lin, Xudong, Feng, Xuande, Koo, Jaywon, Ray, Sounak, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war" manifests at a lower semantic level through subevents "tanks firing" (in video) and airplane "shot" (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research., Comment: AAAI 2024
Published: 2022

16. Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Author: Lin, Xudong, Tiwari, Simran, Huang, Shiyuan, Li, Manling, Shou, Mike Zheng, Ji, Heng, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-channel video-language retrieval require models to understand information from different channels (e.g. video$+$question, video$+$speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on affordable and upgradable multimodal intelligence., Comment: To appear in CVPR 2023; The code will be released at https://github.com/XudongLinthu/upgradable-multimodal-intelligence
Published: 2022

17. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Author: Wang, Zhenhailong, Li, Manling, Xu, Ruochen, Zhou, Luowei, Lei, Jie, Lin, Xudong, Wang, Shuohang, Yang, Ziyi, Zhu, Chenguang, Hoiem, Derek, Chang, Shih-Fu, Bansal, Mohit, and Ji, Heng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets. Code and resources are publicly available for research purposes at https://github.com/MikeWangWZHL/VidIL .
Published: 2022

18. Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

Author: Cai, Guanyu, Ge, Yixiao, Zhang, Binjie, Wang, Alex Jinpeng, Yan, Rui, Lin, Xudong, Shan, Ying, He, Lianghua, Qie, Xiaohu, Wu, Jianping, and Shou, Mike Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revitalize region features of sparsely sampled video clips to significantly reduce both spatial and temporal visual redundancy towards democratizing VLP research at the same time achieving state-of-the-art results. Specifically, to fully explore the potential of region features, we introduce a novel bidirectional region-word alignment regularization that properly optimizes the fine-grained relations between regions and certain words in sentences, eliminating the domain/modality disconnections between pre-extracted region features and text. Extensive results of downstream video-language retrieval tasks on four datasets demonstrate the superiority of our method on both effectiveness and efficiency, \textit{e.g.}, our method achieves competing results with 80\% fewer data and 85\% less pre-training time compared to the most efficient VLP method so far \cite{lei2021less}. The code will be available at \url{https://github.com/showlab/DemoVLP}.
Published: 2022

19. All in One: Exploring Unified Video-Language Pre-training

Author: Wang, Alex Jinpeng, Ge, Yixiao, Yan, Rui, Ge, Yuying, Lin, Xudong, Cai, Guanyu, Wu, Jianping, Shan, Ying, Qie, Xiaohu, and Shou, Mike Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified backbone model. Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning. State-of-the-art performances with the minimal model FLOPs on nine datasets demonstrate the superiority of our method compared to the competitive counterparts. The code and pretrained model have been released in https://github.com/showlab/all-in-one., Comment: 18 pages. 11 figures. Code: https://github.com/showlab/all-in-one
Published: 2022

20. Learning To Recognize Procedural Activities with Distant Supervision

Author: Lin, Xudong, Petroni, Fabio, Bertasius, Gedas, Rohrbach, Marcus, Chang, Shih-Fu, and Torresani, Lorenzo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification., Comment: CVPR 2022. Code will be released here https://github.com/facebookresearch/video-distant-supervision
Published: 2022

21. CLIP-Event: Connecting Text and Images with Event Structures

Author: Li, Manling, Xu, Ruochen, Wang, Shuohang, Zhou, Luowei, Lin, Xudong, Zhu, Chenguang, Zeng, Michael, Ji, Heng, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding objects in images or entities in text, they often ignore the alignment at the level of events and their argument structures. In this work, we propose a contrastive learning framework to enforce vision-language pretraining models to comprehend events and associated argument (participant) roles. To achieve this, we take advantage of text information extraction technologies to obtain event structural knowledge, and utilize multiple prompt functions to contrast difficult negative descriptions by manipulating event structures. We also design an event graph alignment loss based on optimal transport to capture event argument structures. In addition, we collect a large event-rich dataset (106,875 images) for pretraining, which provides a more challenging image retrieval benchmark to assess the understanding of complicated lengthy sentences. Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction on Multimedia Event Extraction, achieving more than 5% absolute F-score gain in event extraction, as well as significant improvements on a variety of downstream tasks under zero-shot settings.
Published: 2022

22. Analysis and reduction of back-reflection straylight in laser ranging system

Author: Zhou, Lixiang, He, Zhizhao, Li, Hui, Ye, Shaowei, Zhou, Chengkai, Han, Xida, Wu, Xianlin, Lin, Xudong, and Li, Ming
Published: 2024
Full Text: View/download PDF

23. Exploring hedging potentials of green bonds against oil price shocks: Evidence from quantile-on-quantile connectedness measures

Author: Lin, Xudong, Meng, Yiqun, and Zhu, Hao
Published: 2024
Full Text: View/download PDF

24. MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

Author: Reddy, Revanth Gangi, Rui, Xilin, Li, Manling, Lin, Xudong, Wen, Haoyang, Cho, Jaemin, Huang, Lifu, Bansal, Mohit, Sil, Avirup, Chang, Shih-Fu, Schwing, Alexander, and Ji, Heng
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task., Comment: Accepted at AAAI 2022
Published: 2021

25. Video-Text Pre-training with Learned Regions

Author: Yan, Rui, Shou, Mike Zheng, Ge, Yixiao, Wang, Alex Jinpeng, Lin, Xudong, Cai, Guanyu, and Tang, Jinhui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatio-temporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for video-text representation learning, namely RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs. Given a video, our module (1) first quantizes visual features into semantic clusters, then (2) generates learnable masks and uses them to aggregate the features belonging to the same semantic region, and finally (3) models the interactions between different aggregated regions. In contrast to using off-the-shelf object detectors, our proposed module does not require explicit supervision and is much more computationally efficient. We pre-train the proposed approach on the public WebVid2M and CC3M datasets. Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our RegionLearner. The code will be available at https://github.com/ruiyan1995/Region_Learner.
Published: 2021

26. Object-aware Video-language Pre-training for Retrieval

Author: Wang, Alex Jinpeng, Ge, Yixiao, Cai, Guanyu, Yan, Rui, Lin, Xudong, Shan, Ying, Qie, Xiaohu, and Shou, Mike Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations. The key idea is to leverage the bounding boxes and object tags to guide the training process. We evaluate our model on three standard sub-tasks of video-text matching on four widely used benchmarks. We also provide deep analysis and detailed ablation about the proposed method. We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture. The code will be released at \url{https://github.com/FingerRec/OA-Transformer}., Comment: CVPR2022; Code: https://github.com/FingerRec/OA-Transformer
Published: 2021

27. Enlarge the dynamic range of Shack-Hartmann wavefront sensor based on biplanar image acquisition and segmentation method

Author: Ye, Shaowei, Li, Ming, Zhou, Lixiang, Zhu, Tianlin, Li, Xin, Han, Xida, Wu, Xianglin, and Lin, Xudong
Published: 2024
Full Text: View/download PDF

28. Joint Multimedia Event Extraction from Video and Article

Author: Chen, Brian, Lin, Xudong, Thomas, Christopher, Li, Manling, Yoshida, Shoya, Chum, Lovish, Ji, Heng, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from video and text articles. We introduce the new task of Video MultiMedia Event Extraction (Video M2E2) and propose two novel components to build the first system towards this task. First, we propose the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score gain on multimodal event coreference resolution and multimedia event extraction., Comment: To be presented at EMNLP 2021 findings
Published: 2021

29. The preparation of synbiotic AHY relieving loperamide-induced constipation and its modulation mechanism in vivo

Author: Jiang, Lai, Zhang, Rui, Lin, Xudong, Tuo, Yanfeng, Mu, Guangqing, and Jiang, Shujuan
Published: 2024
Full Text: View/download PDF

30. Efficiency testing method for the echo receiving system of laser ranging station

Author: Zhou, Lixiang, Han, Xida, Ye, Shaowei, Lin, Xudong, Zhao, Hongchao, Zhu, Tianlin, and Li, Ming
Published: 2024
Full Text: View/download PDF

31. Advancements in life-on-a-chip: The impact of “Beyond Limits Manufacturing” technology

Author: He, Weiwei, Zhang, Hongbo, Lin, Xudong, Zhu, Lili, Zheng, Tingting, Pei, Hao, Tian, Yang, Zhang, Min, Shi, Guoyue, Wu, Lei, Zhao, Jianlong, Wumaier, Gulinuer, Li, Shengqing, Xu, Yufang, Li, Honglin, and Qian, Xuhong
Published: 2024
Full Text: View/download PDF

32. Gate appointment design in a container terminal: A robust optimization approach

Author: Li, Shuqin, Jia, Shuai, Tao, Yi, and Lin, Xudong
Published: 2024
Full Text: View/download PDF

33. A Steady-State Model on Finger-Vein Recognition Accuracy

Author: Liu, Shilei, Li, Qin, Yang, Geng, Lin, Xudong, Zheng, Zhenqi, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Nakamatsu, Kazumi, editor, Kountchev, Roumen, editor, Patnaik, Srikanta, editor, and Abe, Jair M., editor
Published: 2023
Full Text: View/download PDF

34. A comprehensive loss analysis-based decision support method for e-democratic multi-agent cooperative decision-making

Author: Du, Zhijiao, Yu, Sumin, Wang, Jing, Luo, Hanyang, and Lin, Xudong
Published: 2024
Full Text: View/download PDF

35. Co-selection mechanism for bacterial resistance to major chemical pollutants in the environment

Author: Huo, Meixia, Xu, Xiangyue, Mi, Kun, Ma, Wenjin, Zhou, Qin, Lin, Xudong, Cheng, Guyue, and Huang, Lingli
Published: 2024
Full Text: View/download PDF

36. Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

Author: Song, Sijie, Lin, Xudong, Liu, Jiaying, Guo, Zongming, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we address the problem of referring expression comprehension in videos, which is challenging due to complex expression and scene dynamics. Unlike previous methods which solve the problem in multiple stages (i.e., tracking, proposal-based matching), we tackle the problem from a novel perspective, \textbf{co-grounding}, with an elegant one-stage framework. We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency with co-grounding feature learning. Semantic attention learning explicitly parses referring cues in different attributes to reduce the ambiguity in the complex expression. Co-grounding feature learning boosts visual feature representations by integrating temporal correlation to reduce the ambiguity caused by scene dynamics. Experiment results demonstrate the superiority of our framework on the video grounding datasets VID and LiOTB in generating accurate and stable results across frames. Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset. Our project is available at https://sijiesong.github.io/co-grounding., Comment: Accepted to CVPR2021. The project page is at https://sijiesong.github.io/co-grounding
Published: 2021

37. VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Author: Lin, Xudong, Bertasius, Gedas, Wang, Jue, Chang, Shih-Fu, Parikh, Devi, and Torresani, Lorenzo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion modules. To address the non-differentiability of tokenization on continuous inputs (e.g., video or audio), we utilize a relaxation scheme that enables end-to-end training. Furthermore, unlike prior encoder-only models, our network includes an autoregressive decoder to generate open-ended text from the multimodal embeddings fused by the language encoder. This renders our approach fully generative and makes it directly applicable to different "video+$x$ to text" problems without the need to design specialized network heads for each task. The proposed framework is not only conceptually simple but also remarkably effective: experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks -- captioning, question answering and audio-visual scene-aware dialog., Comment: Work in progress
Published: 2021

38. Processing method of laser ranging data based on array of superconducting nanowire single photon detector

Author: Li, Hui, Zhu, Tianlin, Lin, Xudong, Zhou, Chengkai, Wang, Peng, Feng, Jiali, Wang, Jinhao, Wang, Xuan, Wu, Xianlin, Han, Xida, and Li, Ming
Published: 2025
Full Text: View/download PDF

39. Using superconducting single photon detector to improve the coupling efficiency of multimode fibers in the telescope ranging system

Author: Yuan, Chunyu, Ye, Shaowei, Zhou, Chengkai, Li, Ming, Lin, Xudong, and Han, Xida
Published: 2025
Full Text: View/download PDF

40. Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

Author: Huang, Shiyuan, Lin, Xudong, Karaman, Svebor, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Two-stream networks have achieved great success in video recognition. A two-stream network combines a spatial stream of RGB frames and a temporal stream of Optical Flow to make predictions. However, the temporal redundancy of RGB frames as well as the high-cost of optical flow computation creates challenges for both the performance and efficiency. Recent works instead use modern compressed video modalities as an alternative to the RGB spatial stream and improve the inference speed by orders of magnitudes. Previous works create one stream for each modality which are combined with an additional temporal stream through late fusion. This is redundant since some modalities like motion vectors already contain temporal information. Based on this observation, we propose a compressed domain two-stream network IP TSN for compressed video recognition, where the two streams are represented by the two types of frames (I and P frames) in compressed videos, without needing a separate temporal stream. With this goal, we propose to fully exploit the motion information of P-stream through generalized distillation from optical flow, which largely improves the efficiency and accuracy. Our P-stream runs 60 times faster than using optical flow while achieving higher accuracy. Our full IP TSN, evaluated over public action recognition benchmarks (UCF101, HMDB51 and a subset of Kinetics), outperforms other compressed domain methods by large margins while improving the total inference speed by 20%.
Published: 2019

41. Towards Train-Test Consistency for Semi-supervised Temporal Action Localization

Author: Lin, Xudong, Shou, Zheng, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, Weakly-supervised Temporal Action Localization (WTAL) has been densely studied but there is still a large gap between weakly-supervised models and fully-supervised models. It is practical and intuitive to annotate temporal boundaries of a few examples and utilize them to help WTAL models better detect actions. However, the train-test discrepancy of action localization strategy prevents WTAL models from leveraging semi-supervision for further improvement. At training time, attention or multiple instance learning is used to aggregate predictions of each snippet for video-level classification; at test time, they first obtain action score sequences over time, then truncate segments of scores higher than a fixed threshold, and post-process action segments. The inconsistent strategy makes it hard to explicitly supervise the action localization model with temporal boundary annotations at training time. In this paper, we propose a Train-Test Consistent framework, TTC-Loc. In both training and testing time, our TTC-Loc localizes actions by comparing scores of action classes and predicted threshold, which enables it to be trained with semi-supervision. By fixing the train-test discrepancy, our TTC-Loc significantly outperforms the state-of-the-art performance on THUMOS'14, ActivityNet 1.2 and 1.3 when only video-level labels are provided for training. With full annotations of only one video per class and video-level labels for the other videos, our TTC-Loc further boosts the performance and achieves 33.4\% mAP (IoU threshold 0.5) on THUMOS's 14., Comment: Work in progress
Published: 2019

42. Context-Gated Convolution

Author: Lin, Xudong, Ma, Lin, Liu, Wei, and Chang, Shih-Fu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: As the basic building block of Convolutional Neural Networks (CNNs), the convolutional layer is designed to extract local patterns and lacks the ability to model global context in its nature. Many efforts have been recently devoted to complementing CNNs with the global modeling ability, especially by a family of works on global feature interaction. In these works, the global context information is incorporated into local features before they are fed into convolutional layers. However, research on neuroscience reveals that the neurons' ability of modifying their functions dynamically according to context is essential for the perceptual tasks, which has been overlooked in most of CNNs. Motivated by this, we propose one novel Context-Gated Convolution (CGC) to explicitly modify the weights of convolutional layers adaptively under the guidance of global context. As such, being aware of the global context, the modulated convolution kernel of our proposed CGC can better extract representative local patterns and compose discriminative features. Moreover, our proposed CGC is lightweight and applicable with modern CNN architectures, and consistently improves the performance of CNNs according to extensive experiments on image classification, action recognition, and machine translation. Our code of this paper is available at https://github.com/XudongLinthu/context-gated-convolution., Comment: ECCV 2020 camera ready version with appendix
Published: 2019

43. ESG greenwashing and equity mispricing: Evidence from China

Author: Lin, Xudong, Zhu, Hao, and Meng, Yiqun
Published: 2023
Full Text: View/download PDF

44. China’s current forest age structure will lead to weakened carbon sinks in the near future

Author: Shang, Rong, Chen, Jing M., Xu, Mingzhu, Lin, Xudong, Li, Peng, Yu, Guirui, He, Nianpeng, Xu, Li, Gong, Peng, Liu, Liangyun, Liu, Han, and Jiao, Wenzhe
Published: 2023
Full Text: View/download PDF

45. How connected is the crypto market risk to investor sentiment?

Author: Lin, Xudong, Meng, Yiqun, and Zhu, Hao
Published: 2023
Full Text: View/download PDF

46. High-resolution forest age mapping based on forest height maps derived from GEDI and ICESat-2 space-borne lidar data

Author: Lin, Xudong, Shang, Rong, Chen, Jing M., Zhao, Guoshuai, Zhang, Xiaoping, Huang, Yiping, Yu, Guirui, He, Nianpeng, Xu, Li, and Jiao, Wenzhe
Published: 2023
Full Text: View/download PDF

47. A multi-agent decision approach for optimal energy allocation in microgrid system

Author: Huang, Mengxing, Lin, Xudong, Feng, Zikai, Wu, Di, and Shi, Zhiyi
Published: 2023
Full Text: View/download PDF

48. Four different Lactiplantibacillus plantarum strains relieve loperamide-induced constipation in BALB/c mice by regulation of gut microbiota and metabolites

Author: Zhang, Rui, Lin, Xudong, Song, Ying, Tuo, Yanfeng, Mu, Guangqing, and Jiang, Shujuan
Published: 2023
Full Text: View/download PDF

49. A Steady-State Model on Finger-Vein Recognition Accuracy

Author: Liu, Shilei, primary, Li, Qin, additional, Yang, Geng, additional, Lin, Xudong, additional, and Zheng, Zhenqi, additional
Published: 2023
Full Text: View/download PDF

50. What Caused the Inconsistency between Rb−Sr and 40Ar−39Ar Ages of Authigenic Illites?

Author: Liu, Entao, Uysal, I. Tonguç, Zhao, Jian-Xin, Zhang, Zi’ao, and Lin, Xudong
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

569 results on '"Lin, Xudong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources