26,063 results on '"Zhou, Jie"'
Search Results
2. XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
- Author
-
Wang, Ziyi, Wang, Yanbo, Yu, Xumin, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https://github.com/wangzy22/XMask3D., Comment: Accepted to NeurIPS 2024
- Published
- 2024
3. Wafer-scale Semiconductor Grafting: Enabling High-Performance, Lattice-Mismatched Heterojunctions
- Author
-
Zhou, Jie, Zhang, Qiming, Gong, Jiarui, Lu, Yi, Liu, Yang, Abbasi, Haris, Qiu, Haining, Kim, Jisoo, Lin, Wei, Kim, Donghyeok, Li, Yiran, Ng, Tien Khee, Jang, Hokyung, Liu, Dong, Wang, Haiyan, Ooi, Boon S., and Ma, Zhenqiang
- Subjects
Physics - Applied Physics ,Condensed Matter - Materials Science - Abstract
Semiconductor heterojunctions are foundational to many advanced electronic and optoelectronic devices. However, achieving high-quality, lattice-mismatched interfaces remains challenging, limiting both scalability and device performance. Semiconductor grafting offers a promising solution by directly forming electrically active, lattice-mismatched heterojunctions between dissimilar materials. However, its scalability and uniformity at the wafer level have yet to be demonstrated. This work demonstrates the achievement of highly uniform, reproducible results across silicon, sapphire, and gallium nitride (GaN) substrates using wafer-scale semiconductor grafting. To illustrate this scalability, we conducted an in-depth study of a grafted Si/GaN heterojunction, examining band alignment through X-ray photoelectron spectroscopy and confirming crystallinity and interfacial integrity with scanning transmission electron microscopy. The resulting p-n diodes exhibit significantly enhanced electrical performance and wafer-scale uniformity compared to conventional approaches. This work establishes wafer-scale semiconductor grafting as a versatile and scalable technology, bridging the gap between laboratory-scale research and industrial manufacturing for heterogeneous semiconductor integration, and paving the way for novel, high-performance electronic and optoelectronic devices., Comment: 23 pages, 6 figures
- Published
- 2024
4. MaDiNet: Mamba Diffusion Network for SAR Target Detection
- Author
-
Zhou, Jie, Xiao, Chao, Peng, Bowen, Liu, Tianpeng, Liu, Zhen, Liu, Yongxiang, and Liu, Li
- Subjects
Electrical Engineering and Systems Science - Image and Video Processing - Abstract
The fundamental challenge in SAR target detection lies in developing discriminative, efficient, and robust representations of target characteristics within intricate non-cooperative environments. However, accurate target detection is impeded by factors including the sparse distribution and discrete features of the targets, as well as complex background interference. In this study, we propose a \textbf{Ma}mba \textbf{Di}ffusion \textbf{Net}work (MaDiNet) for SAR target detection. Specifically, MaDiNet conceptualizes SAR target detection as the task of generating the position (center coordinates) and size (width and height) of the bounding boxes in the image space. Furthermore, we design a MambaSAR module to capture intricate spatial structural information of targets and enhance the capability of the model to differentiate between targets and complex backgrounds. The experimental results on extensive SAR target detection datasets achieve SOTA, proving the effectiveness of the proposed network. Code is available at \href{https://github.com/JoyeZLearning/MaDiNet}{https://github.com/JoyeZLearning/MaDiNet}.
- Published
- 2024
5. 3-circle Theorem for Willmore surfaces II--degeneration of the complex structure
- Author
-
Li, Yuxiang, Yin, Hao, and Zhou, Jie
- Subjects
Mathematics - Differential Geometry - Abstract
We study the compactness of Willmore surfaces without assuming the convergence of the induced complex structures. In particular, we compute the energy loss in the neck in terms of the residue and we prove that the limit of the image of the Gauss map is a geodesic in the Grassmannian $G(2,n)$ whose length can also be computed in terms of the residue. Moreover, we provide a family of explicit Willmore surfaces in $\R^3$ that illustrate the denegeration phenomenon involved in the above results., Comment: 44 pages
- Published
- 2024
6. Extralonger: Toward a Unified Perspective of Spatial-Temporal Factors for Extra-Long-Term Traffic Forecasting
- Author
-
Zhang, Zhiwei, E, Shaojun, Meng, Fandong, Zhou, Jie, and Han, Wenjuan
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Traffic forecasting plays a key role in Intelligent Transportation Systems, and significant strides have been made in this field. However, most existing methods can only predict up to four hours in the future, which doesn't quite meet real-world demands. we identify that the prediction horizon is limited to a few hours mainly due to the separation of temporal and spatial factors, which results in high complexity. Drawing inspiration from Albert Einstein's relativity theory, which suggests space and time are unified and inseparable, we introduce Extralonger, which unifies temporal and spatial factors. Extralonger notably extends the prediction horizon to a week on real-world benchmarks, demonstrating superior efficiency in the training time, inference time, and memory usage. It sets new standards in long-term and extra-long-term scenarios. The code is available at https://github.com/PlanckChang/Extralonger., Comment: Accepted by NeurIPS2024 workshop
- Published
- 2024
7. CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models
- Author
-
Chen, Meiqi, Meng, Fandong, Zhang, Yingxue, Zhang, Yan, and Zhou, Jie
- Subjects
Computer Science - Computation and Language - Abstract
Large language models (LLMs) have shown great promise in machine translation, but they still struggle with contextually dependent terms, such as new or domain-specific words. This leads to inconsistencies and errors that are difficult to address. Existing solutions often depend on manual identification of such terms, which is impractical given the complexity and evolving nature of language. While Retrieval-Augmented Generation (RAG) could provide some assistance, its application to translation is limited by issues such as hallucinations from information overload. In this paper, we propose CRAT, a novel multi-agent translation framework that leverages RAG and causality-enhanced self-reflection to address these challenges. This framework consists of several specialized agents: the Unknown Terms Identification agent detects unknown terms within the context, the Knowledge Graph (KG) Constructor agent extracts relevant internal knowledge about these terms and retrieves bilingual information from external sources, the Causality-enhanced Judge agent validates the accuracy of the information, and the Translator agent incorporates the refined information into the final output. This automated process allows for more precise and consistent handling of key terms during translation. Our results show that CRAT significantly improves translation accuracy, particularly in handling context-sensitive terms and emerging vocabulary.
- Published
- 2024
8. MiniPLM: Knowledge Distillation for Pre-Training Language Models
- Author
-
Gu, Yuxian, Zhou, Hao, Meng, Fandong, Zhou, Jie, and Huang, Minlie
- Subjects
Computer Science - Computation and Language - Abstract
Knowledge distillation (KD) is widely used to train small, high-performing student language models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training faces challenges in efficiency, flexibility, and effectiveness. Existing methods either incur high computational costs due to online teacher inference, require tokenization matching between teacher and student LMs, or risk losing the difficulty and diversity of the teacher-generated training data. To address these issues, we propose MiniPLM, a KD framework for pre-training LMs by refining the training data distribution with the teacher's knowledge. For efficiency, MiniPLM performs offline teacher LM inference, allowing KD for multiple student LMs without adding training-time costs. For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families. For effectiveness, MiniPLM leverages the differences between large and small LMs to enhance the difficulty and diversity of the training data, helping student LMs acquire versatile and sophisticated knowledge. Extensive experiments demonstrate that MiniPLM boosts the student LMs' performance on 9 widely used downstream tasks, improves the language modeling capabilities, and reduces pre-training computation. The benefit of MiniPLM extends to large pre-training scales, evidenced by the extrapolation of the scaling curves. Further analysis reveals that MiniPLM supports KD across model families and enhances the utilization of pre-training data. Our model, code, and data are available at https://github.com/thu-coai/MiniPLM.
- Published
- 2024
9. V2M: Visual 2-Dimensional Mamba for Image Representation Learning
- Author
-
Wang, Chengkun, Zheng, Wenzhao, Huang, Yuanhui, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. To compensate for the 2D structure information loss (e.g., local similarity) of the original image, most existing methods focus on designing different orders to sequentially process the tokens, which could only alleviate this issue to some extent. In this paper, we propose a Visual 2-Dimensional Mamba (V2M) model as a complete solution, which directly processes image tokens in the 2D space. We first generalize SSM to the 2-dimensional space which generates the next state considering two adjacent states on both dimensions (e.g., columns and rows). We then construct our V2M based on the 2-dimensional SSM formulation and incorporate Mamba to achieve hardware-efficient parallel processing. The proposed V2M effectively incorporates the 2D locality prior yet inherits the efficiency and input-dependent scalability of Mamba. Extensive experimental results on ImageNet classification and downstream visual tasks including object detection and instance segmentation on COCO and semantic segmentation on ADE20K demonstrate the effectiveness of our V2M compared with other visual backbones.
- Published
- 2024
10. GlobalMamba: Global Image Serialization for Vision Mamba
- Author
-
Wang, Chengkun, Zheng, Wenzhao, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.
- Published
- 2024
11. On the token distance modeling ability of higher RoPE attention dimension
- Author
-
Hong, Xiangyu, Jiang, Che, Qi, Biqing, Meng, Fandong, Yu, Mo, Zhou, Bowen, and Zhou, Jie
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual information remains elusive. Based on the intuition that different dimensions correspond to different frequency of changes in RoPE encoding, we conducted a dimension-level analysis to investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. Using our correlation metric, we identified a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing, as evidence by our ablation. We further demonstrate the correlation between the efficiency of length extrapolation and the extension of the high-dimensional attention allocation of these heads. The identification of Positional Heads provides insights for future research in long-text comprehension., Comment: Accepted to EMNLP 2024 Findings
- Published
- 2024
12. SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
- Author
-
Yin, Hang, Xu, Xiuwei, Wu, Zhenyu, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics - Abstract
In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark., Comment: Accepted to NeurIPS 2024. Project page: https://bagh2178.github.io/SG-Nav/
- Published
- 2024
13. DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory
- Author
-
Wang, Yutong, Zeng, Jiali, Liu, Xuebo, Wong, Derek F., Meng, Fandong, Zhou, Jie, and Zhang, Min
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large language models (LLMs) have achieved reasonable quality improvements in machine translation (MT). However, most current research on MT-LLMs still faces significant challenges in maintaining translation consistency and accuracy when processing entire documents. In this paper, we introduce DelTA, a Document-levEL Translation Agent designed to overcome these limitations. DelTA features a multi-level memory structure that stores information across various granularities and spans, including Proper Noun Records, Bilingual Summary, Long-Term Memory, and Short-Term Memory, which are continuously retrieved and updated by auxiliary LLM-based components. Experimental results indicate that DelTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average. DelTA employs a sentence-by-sentence translation strategy, ensuring no sentence omissions and offering a memory-efficient solution compared to the mainstream method. Furthermore, DelTA improves pronoun translation accuracy, and the summary component of the agent also shows promise as a tool for query-based summarization tasks. We release our code and data at https://github.com/YutongWang1216/DocMTAgent.
- Published
- 2024
14. Q-VLM: Post-training Quantization for Large Vision-Language Models
- Author
-
Wang, Changyuan, Wang, Ziwei, Xu, Xiuwei, Tang, Yansong, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at https://github.com/ChangyuanWang17/QVLM.
- Published
- 2024
15. Engineering the Nonlinearity of Bosonic Modes with a Multi-loop SQUID
- Author
-
Hua, Ziyue, Xu, Yifang, Wang, Weiting, Ma, Yuwei, Zhou, Jie, Cai, Weizhou, Ai, Hao, Liu, Yu-xi, Li, Ming, Zou, Chang-Ling, and Sun, Luyan
- Subjects
Quantum Physics - Abstract
Engineering high-order nonlinearities while suppressing lower-order terms is crucial for quantum error correction and state control in bosonic systems, yet it remains an outstanding challenge. Here, we introduce a general framework of Nonlinearity-Engineered Multi-loop SQUID (NEMS) device, enabling the realization of arbitrary nonlinearities by tuning fluxes in multiple loops within superconducting circuits. We demonstrate specific examples of NEMS devices that selectively engineer pure cubic, quartic, and quintic interactions with suppressed parasitic couplings, showing great promise for realizing Kerr-cat bias-preserving {\scshape cnot} gates and stabilizing four-leg cat qubits. By opening new avenues for tailoring nonlinear Hamiltonians of superconducting devices, this work enables sophisticated and precise manipulation of bosonic modes, with potential applications in quantum computation, simulation, and sensing., Comment: 26 pages, 13 figures
- Published
- 2024
16. Observing tight triple uncertainty relations in two-qubit systems
- Author
-
Wang, Yan, Zhou, Jie, Fan, Xing-Yan, Hao, Ze-Yan, Li, Jia-Kun, Liu, Zheng-Hao, Sun, Kai, Xu, Jin-Shi, Chen, Jing-Ling, Li, Chuan-Feng, and Guo, Guang-Can
- Subjects
Quantum Physics - Abstract
As the fundamental tool in quantum information science, the uncertainty principle is essential for manifesting nonclassical properties of quantum systems. Plenty of efforts on the uncertainty principle with two observables have been achieved, making it an appealing challenge to extend the scenario to multiple observables. Here, based on an optical setup, we demonstrate the uncertainty relations in two-qubit systems involving three physical components with the tight constant $2/\sqrt{3}$, which signifies a more precise limit in the measurement of multiple quantum components and offers deeper insights into the trade-offs between observables. Furthermore, we reveal the correspondence of the maximal values of the uncertainty functions and the degree of entanglement, where the more uncertainty is proportional to the higher degree of entanglement. Our results provide a new insight into understanding the uncertainty relations with multiple observables and may motivate more innovative applications in quantum information science., Comment: 13 pages,16 figures
- Published
- 2024
17. DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models
- Author
-
Zhao, Ranchi, Thai, Zhen Leng, Zhang, Yifan, Hu, Shengding, Ba, Yunqi, Zhou, Jie, Cai, Jie, Liu, Zhiyuan, and Sun, Maosong
- Subjects
Computer Science - Computation and Language - Abstract
The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for DecorateLM using a large language model and distill data engineering expertise into a compact 1.2 billion parameter small language model (SLM). We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM. Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
- Published
- 2024
18. Exploring the Benefit of Activation Sparsity in Pre-training
- Author
-
Zhang, Zhengyan, Xiao, Chaojun, Qin, Qiujieli, Lin, Yankai, Zeng, Zhiyuan, Han, Xu, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, and Zhou, Jie
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to $2\times$ faster inference speed. Codes are available at https://github.com/thunlp/moefication., Comment: ICML 2024
- Published
- 2024
19. OPONeRF: One-Point-One NeRF for Robust Neural Rendering
- Author
-
Zheng, Yu, Duan, Yueqi, Zheng, Kangfu, Yan, Hongru, Lu, Jiwen, and Zhou, Jie
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we propose a One-Point-One NeRF (OPONeRF) framework for robust scene rendering. Existing NeRFs are designed based on a key assumption that the target scene remains unchanged between the training and test time. However, small but unpredictable perturbations such as object movements, light changes and data contaminations broadly exist in real-life 3D scenes, which lead to significantly defective or failed rendering results even for the recent state-of-the-art generalizable methods. To address this, we propose a divide-and-conquer framework in OPONeRF that adaptively responds to local scene variations via personalizing appropriate point-wise parameters, instead of fitting a single set of NeRF parameters that are inactive to test-time unseen changes. Moreover, to explicitly capture the local uncertainty, we decompose the point representation into deterministic mapping and probabilistic inference. In this way, OPONeRF learns the sharable invariance and unsupervisedly models the unexpected scene variations between the training and testing scenes. To validate the effectiveness of the proposed method, we construct benchmarks from both realistic and synthetic data with diverse test-time perturbations including foreground motions, illumination variations and multi-modality noises, which are more challenging than conventional generalization and temporal reconstruction benchmarks. Experimental results show that our OPONeRF outperforms state-of-the-art NeRFs on various evaluation metrics through benchmark experiments and cross-scene evaluations. We further show the efficacy of the proposed method via experimenting on other existing generalization-based benchmarks and incorporating the idea of One-Point-One NeRF into other advanced baseline methods., Comment: Project page and dataset: https://yzheng97.github.io/OPONeRF/
- Published
- 2024
20. MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation
- Author
-
Chen, Wenchao, Niu, Liqiang, Lu, Ziyao, Meng, Fandong, and Zhou, Jie
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Image generation models have encountered challenges related to scalability and quadratic complexity, primarily due to the reliance on Transformer-based backbones. In this study, we introduce MaskMamba, a novel hybrid model that combines Mamba and Transformer architectures, utilizing Masked Image Modeling for non-autoregressive image synthesis. We meticulously redesign the bidirectional Mamba architecture by implementing two key modifications: (1) replacing causal convolutions with standard convolutions to better capture global context, and (2) utilizing concatenation instead of multiplication, which significantly boosts performance while accelerating inference speed. Additionally, we explore various hybrid schemes of MaskMamba, including both serial and grouped parallel arrangements. Furthermore, we incorporate an in-context condition that allows our model to perform both class-to-image and text-to-image generation tasks. Our MaskMamba outperforms Mamba-based and Transformer-based models in generation quality. Notably, it achieves a remarkable $54.44\%$ improvement in inference speed at a resolution of $2048\times 2048$ over Transformer.
- Published
- 2024
21. A Survey on the Honesty of Large Language Models
- Author
-
Li, Siheng, Yang, Cheng, Wu, Taiqiang, Shi, Chufan, Zhang, Yuji, Zhu, Xinyu, Cheng, Zesen, Cai, Deng, Yu, Mo, Liu, Lemao, Zhou, Jie, Yang, Yujiu, Wong, Ngai, Wu, Xixin, and Lam, Wai
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don't know and be able to faithfully express their knowledge. Despite promising, current LLMs still exhibit significant dishonest behaviors, such as confidently presenting wrong answers or failing to express what they know. In addition, research on the honesty of LLMs also faces challenges, including varying definitions of honesty, difficulties in distinguishing between known and unknown knowledge, and a lack of comprehensive understanding of related research. To address these issues, we provide a survey on the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Moreover, we offer insights for future research, aiming to inspire further exploration in this important area., Comment: Project Page: https://github.com/SihengLi99/LLM-Honesty-Survey
- Published
- 2024
22. FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner
- Author
-
Zhao, Wenliang, Shi, Minglei, Yu, Xumin, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor's outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1%$\sim$58.3% on class-conditional generation and 29.8%$\sim$38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art. Code is available at https://github.com/shiml20/FlowTurbo., Comment: Accepted to NeurIPS 2024
- Published
- 2024
23. Single-crystalline GaAs/Si Heterojunction Tunnel Diodes Interfaced by an Ultrathin Oxygen-enriched Layer
- Author
-
Zhou, Jie, Wang, Yifan, Yao, Ziqian, Wang, Qingxiao, Banda, Yara S., Gong, Jiarui, Liu, Yang, Adamo, Carolina, Marshall, Patrick, Lu, Yi, Tsai, Tsung-Han, Li, Yiran, Gambin, Vincent, Ng, Tien Khee, Ooi, Boon S., and Ma, Zhenqiang
- Subjects
Physics - Applied Physics ,Condensed Matter - Mesoscale and Nanoscale Physics - Abstract
We report the fabrication and characteristics of GaAs/Si p+/n+ heterojunction tunnel diodes. These diodes were fabricated via grafting the freestanding single-crystalline p-type degenerately doped GaAs (4E19 cm-3) nanomembrane (NM) onto single-crystalline n-type Si (5E19 cm-3) substrate. At the heterointerface, an amorphous ultrathin oxygen-enriched layer (UOL) was intentionally engineered through chemical oxidation and atomic layer deposition (ALD). Scanning transmission electron microscopy (STEM) confirmed the formation of the UOL and the single crystallinity of the grafted junction. The resulting tunnel diodes consistently exhibited negative differential resistance (NDR) behavior at room temperature, with a high maximum peak-to-valley current ratio (PVCR) of 36.38, valley voltages ranging from 1.3 to 1.8 V, and a peak tunneling current density of 0.95 kA/cm2. This study not only highlights the critical roles of the UOL as both an interface improvement layer and a quantum tunneling medium, but also establishes "semiconductor grafting" as an effective and versatile method for high-performance, lattice-mismatched heterojunction devices., Comment: 4 pages, 5 figures
- Published
- 2024
24. Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog
- Author
-
Yao, Bingkun, Wang, Ning, Zhou, Jie, Wang, Xi, Gao, Hong, Jiang, Zhe, and Guan, Nan
- Subjects
Computer Science - Hardware Architecture ,Computer Science - Artificial Intelligence - Abstract
Bug localization in Verilog code is a crucial and time-consuming task during the verification of hardware design. Since introduction, Large Language Models (LLMs) have showed their strong programming capabilities. However, no work has yet considered using LLMs for bug localization in Verilog code. This paper presents Location-is-Key, an opensource LLM solution to locate functional errors in Verilog snippets. LiK achieves high localization accuracy, with a pass@1 localization accuracy of 93.3% on our test dataset based on RTLLM, surpassing GPT-4's 77.9% and comparable to Claude-3.5's 90.8%. Additionally, the bug location obtained by LiK significantly improves GPT-3.5's bug repair efficiency (Functional pass@1 increased from 40.39% to 58.92%), highlighting the importance of bug localization in LLM-based Verilog debugging. Compared to existing methods, LiK only requires the design specification and the erroneous code snippet, without the need for testbenches, assertions, or any other EDA tools. This research demonstrates the feasibility of using LLMs for Verilog error localization, thus providing a new direction for automatic Verilog code debugging.
- Published
- 2024
25. AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
- Author
-
Lan, Zhibin, Niu, Liqiang, Meng, Fandong, Li, Wenbo, Zhou, Jie, and Su, Jinsong
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Recently, when dealing with high-resolution images, dominant LMMs usually divide them into multiple local images and one global image, which will lead to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. This approach not only reduces the number of visual tokens and speeds up inference, but also improves the overall model performance. Specifically, we introduce the following modules based on LLaVA-NeXT: (a) a visual granularity scaler that includes multiple pooling layers to obtain visual tokens with different granularities; (b) a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we propose RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53$\times$ increase in inference speed on the AI2D benchmark)., Comment: Preprint
- Published
- 2024
26. Grafted AlGaAs/GeSn Optical Pumping Laser Operating up to 130 K
- Author
-
Zhou, Jie, Vincent, Daniel, Acharya, Sudip, Ojo, Solomon, Abrand, Alireza, Liu, Yang, Gong, Jiarui, Liu, Dong, Haessly, Samuel, Shen, Jianping, Xu, Shining, Li, Yiran, Lu, Yi, Stanchu, Hryhorii, Mawst, Luke, Claflin, Bruce, Mohseni, Parsian K., Ma, Zhenqiang, and Yu, Shui-Qing
- Subjects
Physics - Optics ,Condensed Matter - Materials Science - Abstract
Group IV GeSn double-heterostructure (DHS) lasers offer unique advantages of a direct bandgap and CMOS compatibility. However, further improvements in laser performance have been bottlenecked by limited junction properties of GeSn through conventional epitaxy and wafer bonding. This work leverages semiconductor grafting to synthesize and characterize optically pumped ridge edge-emitting lasers (EELs) with an AlGaAs nanomembrane (NM) transfer-printed onto an epitaxially grown GeSn substrate, interfaced by an ultrathin Al2O3 layer. The grafted AlGaAs/GeSn DHS lasers show a lasing threshold of 11.06 mW at 77 K and a maximum lasing temperature of 130 K. These results highlight the potential of the grafting technique for enhancing charge carrier and optical field confinements, paving the way for room-temperature electrically injected GeSn lasers., Comment: 5 pages, 5 figures. Supplementary Information included
- Published
- 2024
27. DrawingSpinUp: 3D Animation from Single Character Drawings
- Author
-
Zhou, Jie, Xiao, Chufeng, Lam, Miu-Ling, and Fu, Hongbo
- Subjects
Computer Science - Graphics - Abstract
Animating various character drawings is an engaging visual content creation task. Given a single character drawing, existing animation methods are limited to flat 2D motions and thus lack 3D effects. An alternative solution is to reconstruct a 3D model from a character drawing as a proxy and then retarget 3D motion data onto it. However, the existing image-to-3D methods could not work well for amateur character drawings in terms of appearance and geometry. We observe the contour lines, commonly existing in character drawings, would introduce significant ambiguity in texture synthesis due to their view-dependence. Additionally, thin regions represented by single-line contours are difficult to reconstruct (e.g., slim limbs of a stick figure) due to their delicate structures. To address these issues, we propose a novel system, DrawingSpinUp, to produce plausible 3D animations and breathe life into character drawings, allowing them to freely spin up, leap, and even perform a hip-hop dance. For appearance improvement, we adopt a removal-then-restoration strategy to first remove the view-dependent contour lines and then render them back after retargeting the reconstructed character. For geometry refinement, we develop a skeleton-based thinning deformation algorithm to refine the slim structures represented by the single-line contours. The experimental evaluations and a perceptual user study show that our proposed method outperforms the existing 2D and 3D animation methods and generates high-quality 3D animations from a single character drawing. Please refer to our project page (https://lordliang.github.io/DrawingSpinUp) for the code and generated animations., Comment: 10 pages, 15 figures
- Published
- 2024
28. POINTS: Improving Your Vision-language Model with Affordable Strategies
- Author
-
Liu, Yuan, Zhao, Zhongyin, Zhuang, Ziyuan, Tian, Le, Zhou, Xiao, and Zhou, Jie
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Multimedia - Abstract
In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community., Comment: v2
- Published
- 2024
29. DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation
- Author
-
Zhao, Wenliang, Wang, Haolin, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Diffusion probabilistic models (DPMs) have shown remarkable performance in visual synthesis but are computationally expensive due to the need for multiple evaluations during the sampling. Recent predictor-corrector diffusion samplers have significantly reduced the required number of function evaluations (NFE), but inherently suffer from a misalignment issue caused by the extra corrector step, especially with a large classifier-free guidance scale (CFG). In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment of the predictor-corrector samplers. The dynamic compensation is controlled by compensation ratios that are adaptive to the sampling steps and can be optimized on only 10 datapoints by pushing the sampling trajectory toward a ground truth trajectory. We further propose a cascade polynomial regression (CPR) which can instantly predict the compensation ratios on unseen sampling configurations. Additionally, we find that the proposed dynamic compensation can also serve as a plug-and-play module to boost the performance of predictor-only samplers. Extensive experiments on both unconditional sampling and conditional sampling demonstrate that our DC-Solver can consistently improve the sampling quality over previous methods on different DPMs with a wide range of resolutions up to 1024$\times$1024. Notably, we achieve 10.38 FID (NFE=5) on unconditional FFHQ and 0.394 MSE (NFE=5, CFG=7.5) on Stable-Diffusion-2.1. Code is available at https://github.com/wl-zhao/DC-Solver, Comment: Accepted by ECCV 2024
- Published
- 2024
30. CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models
- Author
-
Liu, Wentao, Pan, Qianjun, Zhang, Yi, Liu, Zhuo, Wu, Ji, Zhou, Jie, Zhou, Aimin, Chen, Qin, Jiang, Bo, and He, Liang
- Subjects
Computer Science - Computation and Language - Abstract
Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.
- Published
- 2024
31. DeTRAP: RISC-V Return Address Protection With Debug Triggers
- Author
-
Richter, Isaac, Zhou, Jie, and Criswell, John
- Subjects
Computer Science - Cryptography and Security - Abstract
Modern microcontroller software is often written in C/C++ and suffers from control-flow hijacking vulnerabilities. Previous mitigations suffer from high performance and memory overheads and require either the presence of memory protection hardware or sophisticated program analysis in the compiler. This paper presents DeTRAP (Debug Trigger Return Address Protection). DeTRAP utilizes a full implementation of the RISC-V debug hardware specification to provide a write-protected shadow stack for return addresses. Unlike previous work, DeTRAP requires no memory protection hardware and only minor changes to the compiler toolchain. We tested DeTRAP on an FPGA running a 32-bit RISC-V microcontroller core and found average execution time overheads to be between 0.5% and 1.9% on evaluated benchmark suites with code size overheads averaging 7.9% or less., Comment: To appear at IEEE Secure Development Conference 2024
- Published
- 2024
32. Characterization of AlGaAs/GeSn heterojunction band alignment via X-ray photoelectron spectroscopy
- Author
-
Liu, Yang, Gong, Jiarui, Acharya, Sudip, Lia, Yiran, Abrand, Alireza, Rudie, Justin M., Zhou, Jie, Lu, Yi, Abbasi, Haris Naeem, Vincent, Daniel, Haessly, Samuel, Tsai, Tsung-Han, Mohseni, Parsian K., Yu, Shui-Qing, and Ma, Zhenqiang
- Subjects
Physics - Applied Physics ,Condensed Matter - Materials Science - Abstract
GeSn-based SWIR lasers featuring imaging, sensing, and communications has gained dynamic development recently. However, the existing SiGeSn/GeSn double heterostructure lacks adequate electron confinement and is insufficient for room temperature lasing. The recently demonstrated semiconductor grafting technique provides a viable approach towards AlGaAs/GeSn p-i-n heterojunctions with better electron confinement and high-quality interfaces, promising for room temperature electrically pumped GeSn laser devices. Therefore, understanding and quantitatively characterizing the band alignment in this grafted heterojunction is crucial. In this study, we explore the band alignment in the grafted monocrystalline Al0.3Ga0.7As /Ge0.853Sn0.147 p-i-n heterojunction. We determined the bandgap values of AlGaAs and GeSn to be 1.81 eV and 0.434 eV by photoluminescence measurements, respectively. We further conducted X-ray photoelectron spectroscopy measurements and extracted a valence band offset of 0.19 eV and a conduction band offset of 1.186 eV. A Type-I band alignment was confirmed which effectively confining electrons at the AlGaAs/GeSn interface. This study improves our understanding of the interfacial band structure in grafted AlGaAs/GeSn heterostructure, providing experimental evidence of the Type-I band alignment between AlGaAs and GeSn, and paving the way for their application in laser technologies., Comment: 18 pages, 4 figures
- Published
- 2024
33. Dynamic compensation for pump-induced frequency shift in Kerr-cat qubit initialization
- Author
-
Xu, Yifang, Hua, Ziyue, Wang, Weiting, Ma, Yuwei, Li, Ming, Chen, Jiajun, Zhou, Jie, Pan, Xiaoxuan, Xiao, Lintao, Huang, Hongwei, Cai, Weizhou, Ai, Hao, Liu, Yu-xi, Zou, Chang-Ling, and Sun, Luyan
- Subjects
Quantum Physics - Abstract
The noise-biased Kerr-cat qubit is an attractive candidate for fault-tolerant quantum computation; however, its initialization faces challenges due to the squeezing pump-induced frequency shift (PIFS). Here, we propose and demonstrate a dynamic compensation method to mitigate the effect of PIFS during the Kerr-cat qubit initialization. Utilizing a novel nonlinearity-engineered triple-loop SQUID device, we realize a stabilized Kerr-cat qubit and validate the advantages of the dynamic compensation method by improving the initialization fidelity from 57% to 78%, with a projected fidelity of 91% after excluding state preparation and measurement errors. Our results not only advance the practical implementation of Kerr-cat qubits, but also provide valuable insights into the fundamental adiabatic dynamics of these systems. This work paves the way for scalable quantum processors that leverage the bias-preserving properties of Kerr-cat qubits.
- Published
- 2024
34. Quantum state transfer between superconducting cavities via exchange-free interactions
- Author
-
Zhou, Jie, Li, Ming, Wang, Weiting, Cai, Weizhou, Hua, Ziyue, Xu, Yifang, Pan, Xiaoxuan, Xue, Guangming, Zhang, Hongyi, Song, Yipu, Yu, Haifeng, Zou, Chang-Ling, and Sun, Luyan
- Subjects
Quantum Physics - Abstract
We propose and experimentally demonstrate a novel protocol for transferring quantum states between superconducting cavities using only continuous two-mode squeezing interactions, without exchange of photonic excitations between cavities. This approach conceptually resembles quantum teleportation, where quantum information is transferred between different nodes without directly transmitting carrier photons. In contrast to the discrete operations of entanglement and Bell-state measurement in teleportation, our scheme is symmetric and continuous. We experimentally realize coherent and bidirectional transfer of arbitrary quantum states, including bosonic quantum error correction codes. Our results offer new insights into the quantum state transfer and quantum teleportation. In particular, our demonstration validates a new approach to realize quantum transducers, and might find applications in a wide range of physical platforms.
- Published
- 2024
35. MedDiT: A Knowledge-Controlled Diffusion Transformer Framework for Dynamic Medical Image Generation in Virtual Simulated Patient
- Author
-
Li, Yanzeng, Zeng, Cheng, Zhang, Jinchao, Zhou, Jie, and Zou, Lei
- Subjects
Computer Science - Artificial Intelligence - Abstract
Medical education relies heavily on Simulated Patients (SPs) to provide a safe environment for students to practice clinical skills, including medical image analysis. However, the high cost of recruiting qualified SPs and the lack of diverse medical imaging datasets have presented significant challenges. To address these issues, this paper introduces MedDiT, a novel knowledge-controlled conversational framework that can dynamically generate plausible medical images aligned with simulated patient symptoms, enabling diverse diagnostic skill training. Specifically, MedDiT integrates various patient Knowledge Graphs (KGs), which describe the attributes and symptoms of patients, to dynamically prompt Large Language Models' (LLMs) behavior and control the patient characteristics, mitigating hallucination during medical conversation. Additionally, a well-tuned Diffusion Transformer (DiT) model is incorporated to generate medical images according to the specified patient attributes in the KG. In this paper, we present the capabilities of MedDiT through a practical demonstration, showcasing its ability to act in diverse simulated patient cases and generate the corresponding medical images. This can provide an abundant and interactive learning experience for students, advancing medical education by offering an immersive simulation platform for future healthcare professionals. The work sheds light on the feasibility of incorporating advanced technologies like LLM, KG, and DiT in education applications, highlighting their potential to address the challenges faced in simulated patient-based medical education.
- Published
- 2024
36. EmbodiedSAM: Online Segment Any 3D Thing in Real Time
- Author
-
Xu, Xiuwei, Chen, Huangxing, Zhao, Linqing, Wang, Ziwei, Zhou, Jie, and Lu, Jiwen
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics - Abstract
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at https://xuxw98.github.io/ESAM/, with only one RTX 3090 GPU required for training and evaluation., Comment: Project page: https://xuxw98.github.io/ESAM/
- Published
- 2024
37. Microsatellite-based real-time quantum key distribution
- Author
-
Li, Yang, Cai, Wen-Qi, Ren, Ji-Gang, Wang, Chao-Ze, Yang, Meng, Zhang, Liang, Wu, Hui-Ying, Chang, Liang, Wu, Jin-Cai, Jin, Biao, Xue, Hua-Jian, Li, Xue-Jiao, Liu, Hui, Yu, Guang-Wen, Tao, Xue-Ying, Chen, Ting, Liu, Chong-Fei, Luo, Wen-Bin, Zhou, Jie, Yong, Hai-Lin, Li, Yu-Huai, Li, Feng-Zhi, Jiang, Cong, Chen, Hao-Ze, Wu, Chao, Tong, Xin-Hai, Xie, Si-Jiang, Zhou, Fei, Liu, Wei-Yue, Liu, Nai-Le, Li, Li, Xu, Feihu, Cao, Yuan, Yin, Juan, Shu, Rong, Wang, Xiang-Bin, Zhang, Qiang, Wang, Jian-Yu, Liao, Sheng-Kai, Peng, Cheng-Zhi, and Pan, Jian-Wei
- Subjects
Quantum Physics - Abstract
A quantum network provides an infrastructure connecting quantum devices with revolutionary computing, sensing, and communication capabilities. As the best-known application of a quantum network, quantum key distribution (QKD) shares secure keys guaranteed by the laws of quantum mechanics. A quantum satellite constellation offers a solution to facilitate the quantum network on a global scale. The Micius satellite has verified the feasibility of satellite quantum communications, however, scaling up quantum satellite constellations is challenging, requiring small lightweight satellites, portable ground stations and real-time secure key exchange. Here we tackle these challenges and report the development of a quantum microsatellite capable of performing space-to-ground QKD using portable ground stations. The quantum microsatellite features a payload weighing approximately 23 kg, while the portable ground station weighs about 100 kg. These weights represent reductions by more than an order and two orders of magnitude, respectively, compared to the Micius satellite. Additionally, we multiplex bidirectional satellite-ground optical communication with quantum communication, enabling key distillation and secure communication in real-time. Using the microsatellite and the portable ground stations, we demonstrate satellite-based QKD with multiple ground stations and achieve the sharing of up to 0.59 million bits of secure keys during a single satellite pass. The compact quantum payload can be readily assembled on existing space stations or small satellites, paving the way for a satellite-constellation-based quantum and classical network for widespread real-life applications., Comment: 40 pages, 8 figures
- Published
- 2024
38. AlGaAs/GeSn p-i-n diode interfaced with ultrathin Al$_2$O$_3$
- Author
-
Liu, Yang, Li, Yiran, Acharya, Sudip, Zhou, Jie, Gong, Jiarui, Abrand, Alireza, Lu, Yi, Vincent, Daniel, Haessly, Samuel, Mohseni, Parsian K., Yu, Shui-Qing, and Ma, Zhenqiang
- Subjects
Physics - Applied Physics ,Condensed Matter - Materials Science - Abstract
This study presents the fabrication and characterizations of an Al$_{0.3}$Ga$_{0.7}$As/Ge$_{0.87}$Sn$_{0.13}$/GeSn p-i-n double heterostructure (DHS) diode following the grafting approach for enhanced optoelectronic applications. By integrating ultra-thin Al$_2$O$_3$ as a quantum tunneling layer and enhancing interfacial double-side passivation, we achieved a heterostructure with a substantial 1.186 eV conduction band barrier between AlGaAs and GeSn, along with a low interfacial density of states. The diode demonstrated impressive electrical characteristics with high uniformity, including a mean ideality factor of 1.47 and a mean rectification ratio of 2.95E103 at +/-2 V across 326 devices, indicating high-quality device fabrication. Comprehensive electrical characterizations, including C-V and I-V profiling, affirm the diode's capability to provide robust electrical confinement and efficient carrier injection. These properties make the Al$_{0.3}$Ga$_{0.7}$As/Ge$_{0.87}$Sn$_{0.13}$/GeSn DHS a promising candidate for next-generation electrically pumped GeSn lasers, potentially operable at higher temperatures. Our results provide a viable pathway for further advancements in various GeSn-based devices., Comment: 5 pages, 4 figures
- Published
- 2024
39. Large Language Models for Base Station Siting: Intelligent Deployment based on Prompt or Agent
- Author
-
Wang, Yanhu, Afzal, Muhammad Muzammil, Li, Zhengyang, Zhou, Jie, Feng, Chenyuan, Guo, Shuaishuai, and Quek, Tony Q. S.
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Traditional base station siting (BSS) methods rely heavily on drive testing and user feedback, which are laborious and require extensive expertise in communication, networking, and optimization. As large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering, network optimization will witness a revolutionary approach. This approach entails the strategic use of well-crafted prompts to infuse human experience and knowledge into these sophisticated LLMs, and the deployment of autonomous agents as a communication bridge to seamlessly connect the machine language based LLMs with human users using natural language. This integration represents the future paradigm of artificial intelligence (AI) as a service and AI for more ease. As a preliminary exploration, this research first develops a novel LLM-empowered BSS optimization framework, and heuristically proposes four different potential implementations: the strategies based on Prompt-optimized LLM (PoL), human-in-the-Loop LLM (HiLL), LLM-empowered autonomous BSS agent (LaBa), and Cooperative multiple LLM-based autonomous BSS agents (CLaBa). Through evaluation on real-world data, the experiments demonstrate that prompt-assisted LLMs and LLM-based agents can generate more efficient, cost-effective, and reliable network deployments, noticeably enhancing the efficiency of BSS optimization and reducing trivial manual participation.
- Published
- 2024
40. Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction
- Author
-
Li, Wenhao, Zhou, Jie, Luo, Chuan, Tang, Chao, Zhang, Kun, and Zhao, Shixiong
- Subjects
Computer Science - Information Retrieval ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,68T09 ,I.2.0 - Abstract
In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices, along with the short online lifespan of scenes, give rise to the complexity of effective recommendations in online and dynamic scenes. In this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that emphasizes high-performance cold-start online recommendations for new scenes. Our approach introduces several crucial capabilities, including scene similarity learning, user-specific scene transition cognition, scene-specific information construction for the new scene, and enhancing the diverged logical information between scenes. We demonstrate SwAN's potential to optimize dynamic multi-scene recommendation problems by effectively online handling cold-start recommendations for any newly arrived scenes. More encouragingly, SwAN has been successfully deployed in Meituan's online catering recommendation service, which serves millions of customers per day, and SwAN has achieved a 5.64% CTR index improvement relative to the baselines and a 5.19% increase in daily order volume proportion., Comment: 10 pages, 6 figures, accepted by Recsys 2024
- Published
- 2024
- Full Text
- View/download PDF
41. MiniCPM-V: A GPT-4V Level MLLM on Your Phone
- Author
-
Yao, Yuan, Yu, Tianyu, Zhang, Ao, Wang, Chongyi, Cui, Junbo, Zhu, Hongji, Cai, Tianchi, Li, Haoyu, Zhao, Weilin, He, Zhihui, Chen, Qianyu, Zhou, Huarong, Zou, Zhensheng, Zhang, Haoye, Hu, Shengding, Zheng, Zhi, Zhou, Jie, Cai, Jie, Han, Xu, Zeng, Guoyang, Li, Dahai, Liu, Zhiyuan, and Sun, Maosong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future., Comment: preprint
- Published
- 2024
42. UniTTA: Unified Benchmark and Versatile Framework Towards Realistic Test-Time Adaptation
- Author
-
Du, Chaoqun, Wang, Yulin, Guo, Jiayi, Han, Yizeng, Zhou, Jie, and Huang, Gao
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Test-Time Adaptation (TTA) aims to adapt pre-trained models to the target domain during testing. In reality, this adaptability can be influenced by multiple factors. Researchers have identified various challenging scenarios and developed diverse methods to address these challenges, such as dealing with continual domain shifts, mixed domains, and temporally correlated or imbalanced class distributions. Despite these efforts, a unified and comprehensive benchmark has yet to be established. To this end, we propose a Unified Test-Time Adaptation (UniTTA) benchmark, which is comprehensive and widely applicable. Each scenario within the benchmark is fully described by a Markov state transition matrix for sampling from the original dataset. The UniTTA benchmark considers both domain and class as two independent dimensions of data and addresses various combinations of imbalance/balance and i.i.d./non-i.i.d./continual conditions, covering a total of \( (2 \times 3)^2 = 36 \) scenarios. It establishes a comprehensive evaluation benchmark for realistic TTA and provides a guideline for practitioners to select the most suitable TTA method. Alongside this benchmark, we propose a versatile UniTTA framework, which includes a Balanced Domain Normalization (BDN) layer and a COrrelated Feature Adaptation (COFA) method--designed to mitigate distribution gaps in domain and class, respectively. Extensive experiments demonstrate that our UniTTA framework excels within the UniTTA benchmark and achieves state-of-the-art performance on average. Our code is available at \url{https://github.com/LeapLabTHU/UniTTA}.
- Published
- 2024
43. Si/AlN p-n heterojunction interfaced with ultrathin SiO2
- Author
-
Abbasi, Haris Naeem, Zhou, Jie, Wang, Ding, Sun, Kai, Wang, Ping, Lu, Yi, Gong, Jiarui, Liu, Dong, Liu, Yang, Singh, Ranveer, Mi, Zetian, and Ma, Zhenqiang
- Subjects
Physics - Applied Physics ,Condensed Matter - Materials Science - Abstract
Ultra-wide bandgap (UWBG) materials hold immense potential for high-power RF electronics and deep ultraviolet photonics. Among these, AlGaN emerges as a promising candidate, offering a tunable bandgap from 3.4 eV (GaN) to 6.1 eV (AlN) and remarkable material characteristics. However, achieving efficient p-type doping in high aluminum composition AlGaN remains a formidable challenge. This study presents an alternative approach to address this issue by fabricating a p+ Si/n-AlN/n+ AlGaN heterojunction structure by following the semiconductor grafting technique. Atomic force microscopy (AFM) analysis revealed that the AlN and the nanomembrane surface exhibited a smooth topography with a roughness of 1.96 nm and 0.545 nm, respectively. High-angle annular dark field scanning transmission electron microscopy (HAADF-STEM) confirmed a sharp and well-defined Si/AlN interface, with minimal defects and strong chemical bonding, crucial for efficient carrier transport. X-ray photoelectron spectroscopy (XPS) measurements demonstrated a type-I heterojunction with a valence band offset of 2.73 eV-2.84 eV and a conduction band offset of 2.22 eV -2.11 eV. The pn diode devices exhibited a linear current-voltage (I-V) characteristic, an ideality factor of 1.92, and a rectification ratio of 3.3E4, with a turn-on voltage of indicating effective p-n heterojunction. Temperature-dependent I-V measurements showed stable operation up to 90 C. The heterojunction's high-quality interface and electrical performance showcase its potential for advanced AlGaN-based optoelectronic and electronic devices., Comment: 23 pages, 6 figures
- Published
- 2024
44. Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching
- Author
-
Ding, Yuyang, Hu, Hanglei, Zhou, Jie, Chen, Qin, Jiang, Bo, and He, Liang
- Subjects
Computer Science - Computation and Language - Abstract
With the introduction of large language models (LLMs), automatic math reasoning has seen tremendous success. However, current methods primarily focus on providing solutions or using techniques like Chain-of-Thought to enhance problem-solving accuracy. In this paper, we focus on improving the capability of mathematics teaching via a Socratic teaching-based LLM (\texttt{SocraticLLM}), which guides learners toward profound thinking with clarity and self-discovery via conversation. We collect and release a high-quality mathematical teaching dataset, named \texttt{SocraticMATH}, which provides Socratic-style conversations of problems with extra knowledge. Also, we propose a knowledge-enhanced LLM as a strong baseline to generate reliable responses with review, guidance/heuristic, rectification, and summarization. Experimental results show the great advantages of \texttt{SocraticLLM} by comparing it with several strong generative models. The codes and datasets are available on \url{https://github.com/ECNU-ICALK/SocraticMath}., Comment: Accepted By CIKM 2024
- Published
- 2024
45. Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words
- Author
-
Chen, Yijie, Liu, Yijin, Meng, Fandong, Xu, Jinan, Chen, Yufeng, and Zhou, Jie
- Subjects
Computer Science - Computation and Language - Abstract
Gender bias has been a focal point in the study of bias in machine translation and language models. Existing machine translation gender bias evaluations are primarily focused on male and female genders, limiting the scope of the evaluation. To assess gender bias accurately, these studies often rely on calculating the accuracy of gender pronouns or the masculine and feminine attributes of grammatical gender via the stereotypes triggered by occupations or sentiment words ({\em i.e.}, clear positive or negative attitude), which cannot extend to non-binary groups. This study presents a benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude words), which assesses gender bias beyond binary gender. Meanwhile, we propose a novel process to evaluate gender bias based on the Emotional Attitude Score (EAS), which is used to quantify ambiguous attitude words. In evaluating three recent and effective open-source LLMs and one powerful multilingual translation-specific model, our main observations are: (1) The translation performance within non-binary gender contexts is markedly inferior in terms of translation quality and exhibits more negative attitudes than binary-gender contexts. (2) The analysis experiments indicate that incorporating constraint context in prompts for gender identity terms can substantially reduce translation bias, while the bias remains evident despite the presence of the constraints. The code is publicly available at \url{https://github.com/pppa2019/ambGIMT}., Comment: The code is publicly available at \url{https://github.com/pppa2019/ambGIMT}
- Published
- 2024
46. Large Language Model for Verilog Generation with Golden Code Feedback
- Author
-
Wang, Ning, Yao, Bingkun, Zhou, Jie, Wang, Xi, Jiang, Zhe, and Guan, Nan
- Subjects
Computer Science - Hardware Architecture ,Computer Science - Artificial Intelligence - Abstract
Recent advancements in large language models (LLMs) have catalyzed significant interest in the automatic generation of Register-Transfer Level (RTL) code, particularly Verilog, from natural language instructions. While commercial LLMs like ChatGPT have dominated this domain, open-source alternatives have lagged considerably in performance, limiting the flexibility and data privacy of this emerging technology. This study introduces a novel approach utilizing reinforcement learning with golden code feedback to enhance the performance of pre-trained models. Leveraging open-source data and base models, we have achieved state-of-the-art (SOTA) results with a substantial margin. Notably, our 6.7B parameter model \ours{} demonstrates superior performance compared to current best-in-class 13B and 16B models. Furthermore, through a comprehensive analysis of the limitations in direct fine-tuning and the training dynamics of reinforcement learning, we posit that the development of comprehensive supervisory signals, which are align with the inherent parallel semantics of Verilog code, is critical to effective generation. The code and data associated with this research are publicly available at \url{https://github.com/CatIIIIIIII/veriseek}. The model weights can be accessed at \url{https://huggingface.co/WANGNingroci/VeriSeek}.
- Published
- 2024
47. Patch-Level Training for Large Language Models
- Author
-
Shao, Chenze, Meng, Fandong, and Zhou, Jie
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5$\times$, without compromising the model performance compared to token-level training. Source code: \url{https://github.com/shaochenze/PatchTrain}.
- Published
- 2024
48. Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task
- Author
-
Yang, Yiran, Zhang, Jinchao, Deng, Ying, and Zhou, Jie
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Inspired by the success of the text-to-image (T2I) generation task, many researchers are devoting themselves to the text-to-video (T2V) generation task. Most of the T2V frameworks usually inherit from the T2I model and add extra-temporal layers of training to generate dynamic videos, which can be viewed as a fine-tuning task. However, the traditional 3D-Unet is a serial mode and the temporal layers follow the spatial layers, which will result in high GPU memory and training time consumption according to its serial feature flow. We believe that this serial mode will bring more training costs with the large diffusion model and massive datasets, which are not environmentally friendly and not suitable for the development of the T2V. Therefore, we propose a highly efficient spatial-temporal parallel training paradigm for T2V tasks, named Mobius. In our 3D-Unet, the temporal layers and spatial layers are parallel, which optimizes the feature flow and backpropagation. The Mobius will save 24% GPU memory and 12% training time, which can greatly improve the T2V fine-tuning task and provide a novel insight for the AIGC community. We will release our codes in the future.
- Published
- 2024
49. Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation
- Author
-
Lan, Zhibin, Niu, Liqiang, Meng, Fandong, Zhou, Jie, Zhang, Min, and Su, Jinsong
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has become an option, which, however, faces two main challenges: 1) the huge modeling burden, as it is required to simultaneously learn alignment across languages and preserve the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences. In this paper, we propose \textit{Translatotron-V(ision)}, an end-to-end IIMT model consisting of four modules. In addition to an image encoder, and an image decoder, our model contains a target text decoder and an image tokenizer. Among them, the target text decoder is used to alleviate the language alignment burden, and the image tokenizer converts long sequences of pixels into shorter sequences of visual tokens, preventing the model from focusing on low-level visual features. Besides, we present a two-stage training framework for our model to assist the model in learning alignment across modalities and languages. Finally, we propose a location-aware evaluation metric called Structure-BLEU to assess the translation quality of the generated images. Experimental results demonstrate that our model achieves competitive performance compared to cascaded models with only 70.9\% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model., Comment: Accepted to ACL 2024 Findings
- Published
- 2024
50. Camera-LiDAR Cross-modality Gait Recognition
- Author
-
Guo, Wenxuan, Liang, Yingping, Pan, Zhiyu, Xi, Ziheng, Feng, Jianjiang, and Zhou, Jie
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Gait recognition is a crucial biometric identification technique. Camera-based gait recognition has been widely applied in both research and industrial fields. LiDAR-based gait recognition has also begun to evolve most recently, due to the provision of 3D structural information. However, in certain applications, cameras fail to recognize persons, such as in low-light environments and long-distance recognition scenarios, where LiDARs work well. On the other hand, the deployment cost and complexity of LiDAR systems limit its wider application. Therefore, it is essential to consider cross-modality gait recognition between cameras and LiDARs for a broader range of applications. In this work, we propose the first cross-modality gait recognition framework between Camera and LiDAR, namely CL-Gait. It employs a two-stream network for feature embedding of both modalities. This poses a challenging recognition task due to the inherent matching between 3D and 2D data, exhibiting significant modality discrepancy. To align the feature spaces of the two modalities, i.e., camera silhouettes and LiDAR points, we propose a contrastive pre-training strategy to mitigate modality discrepancy. To make up for the absence of paired camera-LiDAR data for pre-training, we also introduce a strategy for generating data on a large scale. This strategy utilizes monocular depth estimated from single RGB images and virtual cameras to generate pseudo point clouds for contrastive pre-training. Extensive experiments show that the cross-modality gait recognition is very challenging but still contains potential and feasibility with our proposed model and pre-training strategy. To the best of our knowledge, this is the first work to address cross-modality gait recognition., Comment: Accepted at ECCV 2024
- Published
- 2024
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.