Author: "Xiao, Chenjun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xiao, Chenjun"' showing total 33 results

Start Over Author "Xiao, Chenjun"

33 results on '"Xiao, Chenjun"'

1. Kimi k1.5: Scaling Reinforcement Learning with LLMs

Author: Kimi Team, Du, Angang, Gao, Bofei, Xing, Bowei, Jiang, Changjiu, Chen, Cheng, Li, Cheng, Xiao, Chenjun, Du, Chenzhuang, Liao, Chonghua, Tang, Chuning, Wang, Congcong, Zhang, Dehao, Yuan, Enming, Lu, Enzhe, Tang, Fengxiang, Sung, Flood, Wei, Guangda, Lai, Guokun, Guo, Haiqing, Zhu, Han, Ding, Hao, Hu, Hao, Yang, Hao, Zhang, Hao, Yao, Haotian, Zhao, Haotian, Lu, Haoyu, Li, Haoze, Yu, Haozhen, Gao, Hongcheng, Zheng, Huabin, Yuan, Huan, Chen, Jia, Guo, Jianhang, Su, Jianlin, Wang, Jianzhou, Zhao, Jie, Zhang, Jin, Liu, Jingyuan, Yan, Junjie, Wu, Junyan, Shi, Lidong, Ye, Ling, Yu, Longhui, Dong, Mengnan, Zhang, Neo, Ma, Ningchen, Pan, Qiwei, Gong, Qucheng, Liu, Shaowei, Ma, Shengling, Wei, Shupeng, Cao, Sihan, Huang, Siying, Jiang, Tao, Gao, Weihao, Xiong, Weimin, He, Weiran, Huang, Weixiao, Wu, Wenhao, He, Wenyang, Wei, Xianghui, Jia, Xianqing, Wu, Xingzhe, Xu, Xinran, Zu, Xinxing, Zhou, Xinyu, Pan, Xuehai, Charles, Y., Li, Yang, Hu, Yangyang, Liu, Yangyang, Chen, Yanru, Wang, Yejie, Liu, Yibo, Qin, Yidao, Liu, Yifeng, Yang, Ying, Bao, Yiping, Du, Yulun, Wu, Yuxin, Wang, Yuzhi, Zhou, Zaida, Wang, Zhaoji, Li, Zhaowei, Zhu, Zhen, Zhang, Zheng, Wang, Zhexu, Yang, Zhilin, Huang, Zhiqi, Huang, Zihao, Xu, Ziyao, and Yang, Zonghan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%)., Comment: 25 pages
Published: 2025

2. $\beta$-DQN: Improving Deep Q-Learning By Evolving the Behavior

Author: Zhang, Hongming, Bai, Fengshuo, Xiao, Chenjun, Gao, Chao, Xu, Bo, and Müller, Martin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $\epsilon$-greedy. Motivated by this, we introduce $\beta$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $\beta$. This function estimates the probability that each action has been taken at each state. By leveraging $\beta$, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. $\beta$-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that $\beta$-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.
Published: 2025

3. Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

Author: Gao, Chen-Xiao, Fang, Shengjun, Xiao, Chenjun, Yu, Yang, and Zhang, Zongzhang
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains. Our code is publicly released at https://github.com/typoverflow/WiseRL.
Published: 2024

4. Diffusion Spectral Representation for Reinforcement Learning

Author: Shribak, Dmitry, Gao, Chen-Xiao, Li, Yitong, Xiao, Chenjun, and Dai, Bo
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i.e., sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion models and energy-based models, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings., Comment: NeurIPS 2024
Published: 2024

5. Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

Author: Che, Fengdi, Xiao, Chenjun, Mei, Jincheng, Dai, Bo, Gummadi, Ramki, Ramirez, Oscar A, Harris, Christopher K, Mahmood, A. Rupam, and Schuurmans, Dale
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.
Published: 2024

6. An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

Author: Pan, Yangchen, Wen, Junfeng, Xiao, Chenjun, and Torr, Philip
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis draws connections between the solutions of linear TD learning and ordinary least squares (OLS). We also show that under specific conditions, particularly when noises are correlated, the TD's solution proves to be a more effective estimator than OLS. Furthermore, we establish the convergence of our generalized TD algorithms under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.
Published: 2024

7. Provable Representation with Efficient Planning for Partial Observable Reinforcement Learning

Author: Zhang, Hongming, Ren, Tongzheng, Xiao, Chenjun, Schuurmans, Dale, and Dai, Bo
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications., Comment: The first two authors contribute equally
Published: 2023

8. Rethinking Decision Transformer via Hierarchical Reinforcement Learning

Author: Ma, Yi, Xiao, Chenjun, Liang, Hebin, and Hao, Jianye
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Decision Transformer (DT) is an innovative algorithm leveraging recent advances of the transformer architecture in reinforcement learning (RL). However, a notable limitation of DT is its reliance on recalling trajectories from datasets, losing the capability to seamlessly stitch sub-optimal trajectories together. In this work we introduce a general sequence modeling framework for studying sequential decision making through the lens of Hierarchical RL. At the time of making decisions, a high-level policy first proposes an ideal prompt for the current state, a low-level policy subsequently generates an action conditioned on the given prompt. We show DT emerges as a special case of this framework with certain choices of high-level and low-level policies, and discuss the potential failure of these choices. Inspired by these observations, we study how to jointly optimize the high-level and low-level policies to enable the stitching ability, which further leads to the development of new offline RL algorithms. Our empirical results clearly show that the proposed algorithms significantly surpass DT on several control and navigation benchmarks. We hope our contributions can inspire the integration of transformer architectures within the field of RL.
Published: 2023

9. HarmonyDream: Task Harmonization Inside World Models

Author: Ma, Haoyu, Wu, Jialong, Feng, Ningya, Xiao, Chenjun, Li, Dong, Hao, Jianye, Wang, Jianmin, and Long, Mingsheng
Subjects: Computer Science - Machine Learning
Abstract: Model-based reinforcement learning (MBRL) holds the promise of sample-efficient learning by utilizing a world model, which models how the environment works and typically encompasses components for two tasks: observation modeling and reward modeling. In this paper, through a dedicated empirical investigation, we gain a deeper understanding of the role each task plays in world models and uncover the overlooked potential of sample-efficient MBRL by mitigating the domination of either observation or reward modeling. Our key insight is that while prevalent approaches of explicit MBRL attempt to restore abundant details of the environment via observation models, it is difficult due to the environment's complexity and limited model capacity. On the other hand, reward models, while dominating implicit MBRL and adept at learning compact task-centric dynamics, are inadequate for sample-efficient learning without richer learning signals. Motivated by these insights and discoveries, we propose a simple yet effective approach, HarmonyDream, which automatically adjusts loss coefficients to maintain task harmonization, i.e. a dynamic equilibrium between the two tasks in world model learning. Our experiments show that the base MBRL method equipped with HarmonyDream gains 10%-69% absolute performance boosts on visual robotic tasks and sets a new state-of-the-art result on the Atari 100K benchmark. Code is available at https://github.com/thuml/HarmonyDream., Comment: ICML 2024. Code is available at https://github.com/thuml/HarmonyDream
Published: 2023

10. Iteratively Refined Behavior Regularization for Offline Reinforcement Learning

Author: Hu, Xiaohan, Ma, Yi, Xiao, Chenjun, Zheng, Yan, and Hao, Jianye
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: One of the fundamental challenges for offline reinforcement learning (RL) is ensuring robustness to data distribution. Whether the data originates from a near-optimal policy or not, we anticipate that an algorithm should demonstrate its ability to learn an effective control policy that seamlessly aligns with the inherent distribution of offline data. Unfortunately, behavior regularization, a simple yet effective offline RL algorithm, tends to struggle in this regard. In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. Our key observation is that by iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement, while also implicitly avoiding querying out-of-sample actions to prevent catastrophic learning failures. We prove that in the tabular setting this algorithm is capable of learning the optimal policy covered by the offline dataset, commonly referred to as the in-sample optimal policy. We then explore several implementation details of the algorithm when function approximations are applied. The resulting algorithm is easy to implement, requiring only a few lines of code modification to existing methods. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks, clearly demonstrate its superiority over behavior regularization.
Published: 2023

11. Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning

Author: Zhao, Xutong, Pan, Yangchen, Xiao, Chenjun, Chandar, Sarath, and Rajendran, Janarthanan
Subjects: Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation scheme. The high-level intuition is that to perform optimism-based exploration, agents would explore cooperative strategies if each agent's optimism estimate captures a structured dependency relationship with other agents. Assuming agents compute actions following a sequential order at \textit{each environment timestep}, we provide a perspective to view MARL as tree search iterations by considering agents as nodes at different depths of the search tree. Inspired by the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees), we develop a method called Conditionally Optimistic Exploration (COE). COE augments each agent's state-action value estimate with an action-conditioned optimistic bonus derived from the visitation count of the global state and joint actions of preceding agents. COE is performed during training and disabled at deployment, making it compatible with any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks., Comment: Accepted at UAI 2023
Published: 2023

12. The In-Sample Softmax for Offline Reinforcement Learning

Author: Xiao, Chenjun, Wang, Han, Pan, Yangchen, White, Adam, and White, Martha
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an \emph{in-sample} max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample \emph{softmax} using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also well-suited to fine-tuning.
Published: 2023

13. Latent Variable Representation for Reinforcement Learning

Author: Ren, Tongzheng, Xiao, Chenjun, Zhang, Tianjun, Li, Na, Wang, Zhaoran, Sanghavi, Sujay, Schuurmans, Dale, and Dai, Bo
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks., Comment: ICLR 2023. The first two authors contribute equally. Project Website: https://rlrep.github.io/lvrep/
Published: 2022

14. Understanding the Effect of Stochasticity in Policy Optimization

Author: Mei, Jincheng, Dai, Bo, Xiao, Chenjun, Szepesvari, Csaba, and Schuurmans, Dale
Subjects: Computer Science - Machine Learning
Abstract: We study the effect of stochasticity in on-policy policy optimization, and make the following four contributions. First, we show that the preferability of optimization methods depends critically on whether stochastic versus exact gradients are used. In particular, unlike the true gradient setting, geometric information cannot be easily exploited in the stochastic case for accelerating policy optimization without detrimental consequences or impractical assumptions. Second, to explain these findings we introduce the concept of committal rate for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. Third, we show that in the absence of external oracle information, which allows an algorithm to determine the difference between optimal and sub-optimal actions given only on-policy samples, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an uninformed algorithm either converges to a globally optimal policy with probability $1$ but at a rate no better than $O(1/t)$, or it achieves faster than $O(1/t)$ convergence but then must fail to converge to the globally optimal policy with some positive probability. Finally, we use the committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, then develop an ensemble method that can be guaranteed to achieve near-optimal solutions with high probability., Comment: 68 pages; Accepted at NeurIPS 2021
Published: 2021

15. The Curse of Passive Data Collection in Batch Reinforcement Learning

Author: Xiao, Chenjun, Lee, Ilbin, Dai, Bo, Schuurmans, Dale, and Szepesvari, Csaba
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states. The main focus of the current paper is the characterization of this price. For example, when learning in episodic finite state-action Markov decision processes (MDPs) with $\mathrm{S}$ states and $\mathrm{A}$ actions, we show that even with the best (but passively chosen) logging policy, $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$ episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy, where $H$ is the length of episodes. Note that this shows that the sample complexity blows up exponentially compared to the case of active data collection, a result which is not unexpected, but, as far as we know, have not been published beforehand and perhaps the form of the exact expression is a little surprising. We also extend these results in various directions, such as other criteria or learning in the presence of function approximation, with similar conclusions. A remarkable feature of our result is the sharp characterization of the exponent that appears, which is critical for understanding what makes passive learning hard., Comment: 27 pages, 2 figures. AISTATS 2022. In this revision, we fix an error in the previous upper bound results
Published: 2021

16. On the Optimality of Batch Policy Optimization Algorithms

Author: Xiao, Chenjun, Wu, Yifan, Lattimore, Tor, Dai, Bo, Mei, Jincheng, Li, Lihong, Szepesvari, Csaba, and Schuurmans, Dale
Subjects: Computer Science - Machine Learning
Abstract: Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization., Comment: 29 pages, 8 figures
Published: 2021

17. On the Global Convergence Rates of Softmax Policy Gradient Methods

Author: Mei, Jincheng, Xiao, Chenjun, Szepesvari, Csaba, and Schuurmans, Dale
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform \L{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies., Comment: 64 pages, 5 figures. Published in ICML 2020
Published: 2020

18. Learning to Combat Compounding-Error in Model-Based Reinforcement Learning

Author: Xiao, Chenjun, Wu, Yifan, Ma, Chen, Schuurmans, Dale, and Müller, Martin
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Despite its potential to improve sample complexity versus model-free approaches, model-based reinforcement learning can fail catastrophically if the model is inaccurate. An algorithm should ideally be able to trust an imperfect model over a reasonably long planning horizon, and only rely on model-free updates when the model errors get infeasibly large. In this paper, we investigate techniques for choosing the planning horizon on a state-dependent basis, where a state's planning horizon is determined by the maximum cumulative model error around that state. We demonstrate that these state-dependent model errors can be learned with Temporal Difference methods, based on a novel approach of temporally decomposing the cumulative model errors. Experimental results show that the proposed method can successfully adapt the planning horizon to account for state-dependent model accuracy, significantly improving the efficiency of policy learning compared to model-based and model-free baselines.
Published: 2019

19. Integrating Factorization Ranked Features in MCTS: An Experimental Study

Author: Xiao, Chenjun, Müller, Martin, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Kotenko, Igor, Series editor, Liu, Ting, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Cazenave, Tristan, editor, Winands, Mark H.M., editor, Edelkamp, Stefan, editor, Schiffel, Stephan, editor, Thielscher, Michael, editor, and Togelius, Julian, editor
Published: 2017
Full Text: View/download PDF

20. Efficient Reinforcement Learning from Partial Observability

Author: Zhang, Hongming, Ren, Tongzheng, Xiao, Chenjun, Schuurmans, Dale, Dai, Bo, Zhang, Hongming, Ren, Tongzheng, Xiao, Chenjun, Schuurmans, Dale, and Dai, Bo
Abstract: In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications., Comment: The first two authors contribute equally
Published: 2023

21. In-Sample Policy Iteration for Offline Reinforcement Learning

Author: Hu, Xiaohan, Ma, Yi, Xiao, Chenjun, Zheng, Yan, and Meng, Zhaopeng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: Offline reinforcement learning (RL) seeks to derive an effective control policy from previously collected data. To circumvent errors due to inadequate data coverage, behavior-regularized methods optimize the control policy while concurrently minimizing deviation from the data collection policy. Nevertheless, these methods often exhibit subpar practical performance, particularly when the offline dataset is collected by sub-optimal policies. In this paper, we propose a novel algorithm employing in-sample policy iteration that substantially enhances behavior-regularized methods in offline RL. The core insight is that by continuously refining the policy used for behavior regularization, in-sample policy iteration gradually improves itself while implicitly avoids querying out-of-sample actions to avert catastrophic learning failures. Our theoretical analysis verifies its ability to learn the in-sample optimal policy, exclusively utilizing actions well-covered by the dataset. Moreover, we propose competitive policy improvement, a technique applying two competitive policies, both of which are trained by iteratively improving over the best competitor. We show that this simple yet potent technique significantly enhances learning efficiency when function approximation is applied. Lastly, experimental results on the D4RL benchmark indicate that our algorithm outperforms previous state-of-the-art methods in most tasks.
Published: 2023
Full Text: View/download PDF

22. Advances in Simulation-Based Search and Batch Reinforcement Learning

Author: Xiao, Chenjun
Abstract: Reinforcement learning (RL) defines a general computational problem where the learner must learn to make good decisions through interactive experience. To be effective in solving this problem, the learner must be able to explore the environment, make accurate predictions about the future, and compute strategic plans. These joint challenges distinguish RL from other machine learning problems. This dissertation considers two sub-topics of RL: Planning and Batch RL. For planning, we contribute two novel techniques to improve the efficiency of Monte Carlo Tree Search (MCTS): 1) Memory-augmented MCTS incorporates a memory structure into MCTS in order to generate an approximate value estimate that combines the estimate of similar states; 2) a new MCTS algorithm that applies maximum entropy policy optimization to general sequential decision-making. For batch RL, we offer three analyses towards a better understanding of the theoretical foundations of batch RL: 1) a minimax and instance-dependent analysis of batch policy optimization algorithms; 2) a characterization of the curse of passive data collection in batch RL; and 3) a theoretical analysis of convergence and generalization properties of value prediction algorithms with overparameterized models.
Published: 2023
Full Text: View/download PDF

23. Design and implementation of Visual Epigynous Machine on Robotic Sorting

Author: Zeng, Lingchao, primary, Cao, Yuhua, additional, Xiao, Chenjun, additional, Zhong, Xiaohua, additional, Wei, Shangming, additional, and Zeng, Guanlin, additional
Published: 2022
Full Text: View/download PDF

24. On Principled Entropy Exploration in Policy Optimization

Author: Mei, Jincheng, primary, Xiao, Chenjun, additional, Huang, Ruitong, additional, Schuurmans, Dale, additional, and Müller, Martin, additional
Published: 2019
Full Text: View/download PDF

25. Memory-Augmented Monte Carlo Tree Search

Author: Xiao, Chenjun, primary, Mei, Jincheng, additional, and Müller, Martin, additional
Published: 2018
Full Text: View/download PDF

26. Factorization Ranking Model for Fast Move Prediction in the Game of Go

Author: Xiao, Chenjun
Subjects: 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 02 engineering and technology
Published: 2016
Full Text: View/download PDF

27. Only-One-Victor Pattern Learning in Computer Go

Author: Wang, Jiao, primary, Xiao, Chenjun, additional, Zhu, Tan, additional, Hsueh, Chu-Husan, additional, Tseng, Wen-Jie, additional, and Wu, I-Chen, additional
Published: 2017
Full Text: View/download PDF

28. Hash table in Chinese Chess

Author: Li Hongye, Xiao Chenjun, Lv Huizhan, and Wang Jiao
Subjects: Primary clustering, Theoretical computer science, Universal hashing, Computer science, Dynamic perfect hashing, Hash buster, Hash function, Zobrist hashing, 2-choice hashing, Linear hashing, Rolling hash, Hash table, Hopscotch hashing, Hash tree, Cuckoo hashing, Data_FILES, Hash chain, Cryptographic hash function, Consistent hashing, Perfect hash function, Double hashing, Tabulation hashing
Abstract: Hash table is a very important technique in computer games. With the development of computer Chinese Chess, hash table is wildly used and some of implementations are innovational. Several notable hash table implementations are introduced, some of them being new, while some combing with the specific modification on traditional ones, all considering special characteristics of Chinese Chess. The two common methods, Zobrist hashing and lockless algorithm in parallel search, are also put forward. The experimental results reveal these hash tables are remarkable and essential, significantly improving the overall performance.
Published: 2012
Full Text: View/download PDF

29. Factorization Ranking Model for Move Prediction in the Game of Go

Author: Xiao, Chenjun, primary and Müller, Martin, additional
Published: 2016
Full Text: View/download PDF

30. Applying determinized MCTS in Chinese Military Chess

Author: Xiao, Chenjun, primary, Zhu, Tan, additional, Lin, Chao, additional, Xu, Xinhe, additional, and Wang, Jiao, additional
Published: 2014
Full Text: View/download PDF

31. The 2nd National University Student Computer-Games Tournaments

Author: Wang, Jiao, primary, Li, Hongye, additional, Xiao, Chenjun, additional, and Xu, Xin-He, additional
Published: 2012
Full Text: View/download PDF

32. Hash table in Chinese Chess.

Author: Lv Huizhan, Xiao Chenjun, Li Hongye, and Wang Jiao
Abstract: Hash table is a very important technique in computer games. With the development of computer Chinese Chess, hash table is wildly used and some of implementations are innovational. Several notable hash table implementations are introduced, some of them being new, while some combing with the specific modification on traditional ones, all considering special characteristics of Chinese Chess. The two common methods, Zobrist hashing and lockless algorithm in parallel search, are also put forward. The experimental results reveal these hash tables are remarkable and essential, significantly improving the overall performance. [ABSTRACT FROM PUBLISHER]
Published: 2012
Full Text: View/download PDF

33. The 2ndNational University Student Computer-Games Tournaments

Author: Wang, Jiao, Li, Hongye, Xiao, Chenjun, and Xu, Xin-He
Published: 2012
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

33 results on '"Xiao, Chenjun"'

1. Kimi k1.5: Scaling Reinforcement Learning with LLMs

2. $\beta$-DQN: Improving Deep Q-Learning By Evolving the Behavior

3. Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

4. Diffusion Spectral Representation for Reinforcement Learning

5. Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

6. An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

7. Provable Representation with Efficient Planning for Partial Observable Reinforcement Learning

8. Rethinking Decision Transformer via Hierarchical Reinforcement Learning

9. HarmonyDream: Task Harmonization Inside World Models

10. Iteratively Refined Behavior Regularization for Offline Reinforcement Learning

11. Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning

12. The In-Sample Softmax for Offline Reinforcement Learning

13. Latent Variable Representation for Reinforcement Learning

14. Understanding the Effect of Stochasticity in Policy Optimization

15. The Curse of Passive Data Collection in Batch Reinforcement Learning

16. On the Optimality of Batch Policy Optimization Algorithms

17. On the Global Convergence Rates of Softmax Policy Gradient Methods

18. Learning to Combat Compounding-Error in Model-Based Reinforcement Learning

19. Integrating Factorization Ranked Features in MCTS: An Experimental Study

20. Efficient Reinforcement Learning from Partial Observability

21. In-Sample Policy Iteration for Offline Reinforcement Learning

22. Advances in Simulation-Based Search and Batch Reinforcement Learning

23. Design and implementation of Visual Epigynous Machine on Robotic Sorting

24. On Principled Entropy Exploration in Policy Optimization

25. Memory-Augmented Monte Carlo Tree Search

26. Factorization Ranking Model for Fast Move Prediction in the Game of Go

27. Only-One-Victor Pattern Learning in Computer Go

28. Hash table in Chinese Chess

29. Factorization Ranking Model for Move Prediction in the Game of Go

30. Applying determinized MCTS in Chinese Military Chess

31. The 2nd National University Student Computer-Games Tournaments

32. Hash table in Chinese Chess.

33. The 2ndNational University Student Computer-Games Tournaments

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

33 results on '"Xiao, Chenjun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources