Author: "Wang, Zhaoran" / Topic: fos: computer and information sciences - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang, Zhaoran"' showing total 105 results

Start Over Author "Wang, Zhaoran" Topic fos: computer and information sciences

105 results on '"Wang, Zhaoran"'

1. Contextual Dynamic Pricing with Strategic Buyers

Author: Liu, Pangpang, Yang, Zhuoran, Wang, Zhaoran, and Sun, Will Wei
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Machine Learning (stat.ML), Computer Science and Game Theory (cs.GT), Machine Learning (cs.LG)
Abstract: Personalized pricing, which involves tailoring prices based on individual characteristics, is commonly used by firms to implement a consumer-specific pricing policy. In this process, buyers can also strategically manipulate their feature data to obtain a lower price, incurring certain manipulation costs. Such strategic behavior can hinder firms from maximizing their profits. In this paper, we study the contextual dynamic pricing problem with strategic buyers. The seller does not observe the buyer's true feature, but a manipulated feature according to buyers' strategic behavior. In addition, the seller does not observe the buyers' valuation of the product, but only a binary response indicating whether a sale happens or not. Recognizing these challenges, we propose a strategic dynamic pricing policy that incorporates the buyers' strategic behavior into the online learning to maximize the seller's cumulative revenue. We first prove that existing non-strategic pricing policies that neglect the buyers' strategic behavior result in a linear $\Omega(T)$ regret with $T$ the total time horizon, indicating that these policies are not better than a random pricing policy. We then establish that our proposed policy achieves a sublinear regret upper bound of $O(\sqrt{T})$. Importantly, our policy is not a mere amalgamation of existing dynamic pricing policies and strategic behavior handling algorithms. Our policy can also accommodate the scenario when the marginal cost of manipulation is unknown in advance. To account for it, we simultaneously estimate the valuation parameter and the cost parameter in the online pricing policy, which is shown to also achieve an $O(\sqrt{T})$ regret bound. Extensive experiments support our theoretical developments and demonstrate the superior performance of our policy compared to other pricing policies that are unaware of the strategic behaviors.
Published: 2023

2. A General Framework for Sequential Decision-Making under Adaptivity Constraints

Author: Xiong, Nuoya, Wang, Zhaoran, and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: We take the first step in studying general sequential decision-making under two adaptivity constraints: rare policy switch and batch learning. First, we provide a general class called the Eluder Condition class, which includes a wide range of reinforcement learning classes. Then, for the rare policy switch constraint, we provide a generic algorithm to achieve a $\widetilde{\mathcal{O}}(\log K) $ switching cost with a $\widetilde{\mathcal{O}}(\sqrt{K})$ regret on the EC class. For the batch learning constraint, we provide an algorithm that provides a $\widetilde{\mathcal{O}}(\sqrt{K}+K/B)$ regret with the number of batches $B.$ This paper is the first work considering rare policy switch and batch learning under general function classes, which covers nearly all the models studied in the previous works such as tabular MDP (Bai et al. 2019; Zhang et al. 2020), linear MDP (Wang et al. 2021; Gao et al. 2021), low eluder dimension MDP (Kong et al. 2021; Gao et al. 2021), generalized linear function approximation (Qiao et al. 2023), and also some new classes such as the low $D_\Delta$-type Bellman eluder dimension problem, linear mixture MDP, kernelized nonlinear regulator and undercomplete partially observed Markov decision process (POMDP)., Comment: 48 pages
Published: 2023

3. What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Author: Zhang, Yufeng, Zhang, Fengzhuo, Yang, Zhuoran, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing several open questions: (a) What type of ICL estimator is learned within language models? (b) What are suitable performance metrics to evaluate ICL accurately and what are the error rates? (c) How does the transformer architecture enable ICL? To answer (a), we take a Bayesian view and demonstrate that ICL implicitly implements the Bayesian model averaging algorithm. This Bayesian model averaging algorithm is proven to be approximately parameterized by the attention mechanism. For (b), we analyze the ICL performance from an online learning perspective and establish a regret bound $\mathcal{O}(1/T)$, where $T$ is the ICL input sequence length. To address (c), in addition to the encoded Bayesian model averaging algorithm in attention, we show that during pertaining, the total variation distance between the learned model and the nominal model is bounded by a sum of an approximation error and a generalization error of $\tilde{\mathcal{O}}(1/\sqrt{N_{\mathrm{p}}T_{\mathrm{p}}})$, where $N_{\mathrm{p}}$ and $T_{\mathrm{p}}$ are the number of token sequences and the length of each sequence in pretraining, respectively. Our results provide a unified understanding of the transformer and its ICL ability with bounds on ICL regret, approximation, and generalization, which deepens our knowledge of these essential aspects of modern language models.
Published: 2023

4. One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

Author: Liu, Zhihan, Lu, Miao, Xiong, Wei, Zhong, Han, Hu, Hao, Zhang, Shenao, Zheng, Sirui, Yang, Zhuoran, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Optimization and Control (math.OC), Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration. However, in order to cope with general function approximators, most of them involve impractical algorithmic components to incentivize exploration, such as optimization within data-dependent level-sets or complicated sampling procedures. To address this challenge, we propose an easy-to-implement RL framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs to optimize \emph{unconstrainedly} a single objective that integrates the estimation and planning components while balancing exploration and exploitation automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear regret with general function approximations for Markov decision processes (MDP) and is further extendable to two-player zero-sum Markov games (MG). Meanwhile, we adapt deep RL baselines to design practical versions of \texttt{MEX}, in both model-free and model-based manners, which can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards. Compared with existing sample-efficient online RL algorithms with general function approximations, \texttt{MEX} achieves similar sample efficiency while enjoying a lower computational cost and is more compatible with modern deep RL methods.
Published: 2023

5. Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning

Author: Zhao, Yulai, Yang, Zhuoran, Wang, Zhaoran, and Lee, Jason D.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, Machine Learning (stat.ML), Computer Science - Multiagent Systems, Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT), Multiagent Systems (cs.MA)
Abstract: Policy optimization methods with function approximation are widely used in multi-agent reinforcement learning. However, it remains elusive how to design such algorithms with statistical guarantees. Leveraging a multi-agent performance difference lemma that characterizes the landscape of multi-agent policy optimization, we find that the localized action value function serves as an ideal descent direction for each local policy. Motivated by the observation, we present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate. We extend our algorithm to the off-policy setting and introduce pessimism to policy evaluation, which aligns with experiments. To our knowledge, this is the first provably convergent multi-agent PPO algorithm in cooperative Markov games., ICML 2023
Published: 2023

6. Dynamic Datasets and Market Environments for Financial Reinforcement Learning

Author: Liu, Xiao-Yang, Xia, Ziyi, Yang, Hongyang, Gao, Jiechao, Zha, Daochen, Zhu, Ming, Wang, Christina Dan, Wang, Zhaoran, and Guo, Jian
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: The financial market is a particularly challenging playground for deep reinforcement learning due to its unique feature of dynamic datasets. Building high-quality market environments for training financial reinforcement learning (FinRL) agents is difficult due to major factors such as the low signal-to-noise ratio of financial data, survivorship bias of historical data, and model overfitting. In this paper, we present FinRL-Meta, a data-centric and openly accessible library that processes dynamic datasets from real-world markets into gym-style market environments and has been actively maintained by the AI4Finance community. First, following a DataOps paradigm, we provide hundreds of market environments through an automatic data curation pipeline. Second, we provide homegrown examples and reproduce popular research papers as stepping stones for users to design new trading strategies. We also deploy the library on cloud platforms so that users can visualize their own results and assess the relative performance via community-wise competitions. Third, we provide dozens of Jupyter/Python demos organized into a curriculum and a documentation website to serve the rapidly growing community. The open-source codes for the data curation pipeline are available at https://github.com/AI4Finance-Foundation/FinRL-Meta, 49 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2211.03107
Published: 2023

7. Finding Regularized Competitive Equilibria of Heterogeneous Agent Macroeconomic Models with Reinforcement Learning

Author: Xu, Ruitu, Min, Yifei, Wang, Tianhao, Wang, Zhaoran, Jordan, Michael I., and Yang, Zhuoran
Subjects: FOS: Economics and business, FOS: Computer and information sciences, Computer Science - Machine Learning, General Economics (econ.GN), Machine Learning (cs.LG), Economics - General Economics
Abstract: We study a heterogeneous agent macroeconomic model with an infinite number of households and firms competing in a labor market. Each household earns income and engages in consumption at each time step while aiming to maximize a concave utility subject to the underlying market conditions. The households aim to find the optimal saving strategy that maximizes their discounted cumulative utility given the market condition, while the firms determine the market conditions through maximizing corporate profit based on the household population behavior. The model captures a wide range of applications in macroeconomic studies, and we propose a data-driven reinforcement learning framework that finds the regularized competitive equilibrium of the model. The proposed algorithm enjoys theoretical guarantees in converging to the equilibrium of the market at a sub-linear rate., 44 pages
Published: 2023

8. Differentiable Arbitrating in Zero-sum Markov Games

Author: Wang, Jing, Song, Meichen, Gao, Feng, Liu, Boyi, Wang, Zhaoran, and Wu, Yi
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems, Multiagent Systems (cs.MA), Machine Learning (cs.LG)
Abstract: We initiate the study of how to perturb the reward in a zero-sum Markov game with two players to induce a desirable Nash equilibrium, namely arbitrating. Such a problem admits a bi-level optimization formulation. The lower level requires solving the Nash equilibrium under a given reward function, which makes the overall problem challenging to optimize in an end-to-end way. We propose a backpropagation scheme that differentiates through the Nash equilibrium, which provides the gradient feedback for the upper level. In particular, our method only requires a black-box solver for the (regularized) Nash equilibrium (NE). We develop the convergence analysis for the proposed framework with proper black-box NE solvers and demonstrate the empirical successes in two multi-agent reinforcement learning (MARL) environments.
Published: 2023

9. Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Author: Xu, Haoran, Jiang, Li, Li, Jianxiong, Yang, Zhuoran, Wang, Zhaoran, Chan, Victor Wai Kin, and Zhan, Xianyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes., Comment: ICLR 2023 notable top 5%
Published: 2023
Full Text: View/download PDF

10. A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations

Author: Chen, Siyu, Wang, Yitan, Wang, Zhaoran, and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study the offline contextual bandit problem, where we aim to acquire an optimal policy using observational data. However, this data usually contains two deficiencies: (i) some variables that confound actions are not observed, and (ii) missing observations exist in the collected data. Unobserved confounders lead to a confounding bias and missing observations cause bias and inefficiency problems. To overcome these challenges and learn the optimal policy from the observed dataset, we present a new algorithm called Causal-Adjusted Pessimistic (CAP) policy learning, which forms the reward function as the solution of an integral equation system, builds a confidence set, and greedily takes action with pessimism. With mild assumptions on the data, we develop an upper bound to the suboptimality of CAP for the offline contextual bandit problem., Comment: 76 page, 5 figures
Published: 2023
Full Text: View/download PDF

11. Achieving Hierarchy-Free Approximation for Bilevel Programs With Equilibrium Constraints

Author: Li, Jiayang, Yu, Jing, Liu, Boyi, Wang, Zhaoran, and Nie, Yu Marco
Subjects: FOS: Computer and information sciences, Computer Science - Computer Science and Game Theory, Optimization and Control (math.OC), FOS: Mathematics, Computer Science - Multiagent Systems, Mathematics - Optimization and Control, Computer Science and Game Theory (cs.GT), Multiagent Systems (cs.MA)
Abstract: In this paper, we develop an approximation scheme for solving bilevel programs with equilibrium constraints, which are generally difficult to solve. Among other things, calculating the first-order derivative in such a problem requires differentiation across the hierarchy, which is computationally intensive, if not prohibitive. To bypass the hierarchy, we propose to bound such bilevel programs, equivalent to multiple-followers Stackelberg games, with two new hierarchy-free problems: a $T$-step Cournot game and a $T$-step monopoly model. Since they are standard equilibrium or optimization problems, both can be efficiently solved via first-order methods. Importantly, we show that the bounds provided by these problems -- the upper bound by the $T$-step Cournot game and the lower bound by the $T$-step monopoly model -- can be made arbitrarily tight by increasing the step parameter $T$ for a wide range of problems. We prove that a small $T$ usually suffices under appropriate conditions to reach an approximation acceptable for most practical purposes. Eventually, the analytical insights are highlighted through numerical examples.
Published: 2023
Full Text: View/download PDF

12. Wardrop Equilibrium Can Be Boundedly Rational: A New Behavioral Theory of Route Choice

Author: Li, Jiayang, Wang, Zhaoran, and Nie, Yu Marco
Subjects: FOS: Economics and business, FOS: Computer and information sciences, History, Polymers and Plastics, Computer Science - Computer Science and Game Theory, Economics - Theoretical Economics, Theoretical Economics (econ.TH), Business and International Management, Industrial and Manufacturing Engineering, Computer Science and Game Theory (cs.GT)
Abstract: As one of the most fundamental concepts in transportation science, Wardrop equilibrium (WE) has always had a relatively weak behavioral underpinning. To strengthen this foundation, one must reckon with bounded rationality in human decision-making processes, such as the lack of accurate information, limited computing power, and sub-optimal choices. This retreat from behavioral perfectionism in the literature, however, was typically accompanied by a conceptual modification of WE. Here we show that giving up perfect rationality need not force a departure from WE. On the contrary, WE can be reached with global stability in a routing game played by boundedly rational travelers. We achieve this result by developing a day-to-day (DTD) dynamical model that mimics how travelers gradually adjust their route valuations, hence choice probabilities, based on past experiences. Our model, called cumulative logit (CULO), resembles the classical DTD models but makes a crucial change: whereas the classical models assume routes are valued based on the cost averaged over historical data, ours values the routes based on the cost accumulated. To describe route choice behaviors, the CULO model only uses two parameters, one accounting for the rate at which the future route cost is discounted in the valuation relative to the past ones and the other describing the sensitivity of route choice probabilities to valuation differences. We prove that the CULO model always converges to WE, regardless of the initial point, as long as the behavioral parameters satisfy certain mild conditions. Our theory thus upholds WE's role as a benchmark in transportation systems analysis. It also resolves the theoretical challenge posed by Harsanyi's instability problem by explaining why equally good routes at WE are selected with different probabilities.
Published: 2023

13. An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Author: Zhang, Yufeng, Liu, Boyi, Cai, Qi, Wang, Lingxiao, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks., 85 pages, 7 figures
Published: 2022

14. Policy learning 'without' overlap: Pessimism and generalized empirical Bernstein's inequality

Author: Jin, Ying, Ren, Zhimei, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Methodology (stat.ME), FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Statistics Theory, Statistics Theory (math.ST), Statistics - Methodology, Machine Learning (cs.LG)
Abstract: This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn the optimal individualized decision rule in a given class. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset. In other words, the performance of these methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed by quantifying the estimation uncertainty of the augmented inverse propensity weighted (AIPW)-type estimators using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which depends only on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized concentration inequality for IPW estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.
Published: 2022

15. Latent Variable Representation for Reinforcement Learning

Author: Ren, Tongzheng, Xiao, Chenjun, Zhang, Tianjun, Li, Na, Wang, Zhaoran, Sanghavi, Sujay, Schuurmans, Dale, and Dai, Bo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks., ICLR 2023. The first two authors contribute equally. Project Website: https://rlrep.github.io/lvrep/
Published: 2022

16. GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond

Author: Zhong, Han, Xiong, Wei, Zheng, Sirui, Wang, Liwei, Wang, Zhaoran, Yang, Zhuoran, and Zhang, Tong
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Optimization and Control (math.OC), Computer Science - Artificial Intelligence, Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation in online interactive decision making. In specific, GEC captures the hardness of exploration by comparing the error of predicting the performance of the updated policy with the in-sample training error evaluated on the historical data. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR, where generalized regular PSR, a new tractable PSR class identified by us, includes nearly all known tractable POMDPs and PSRs. Furthermore, in terms of algorithm design, we propose a generic posterior sampling algorithm, which can be implemented in both model-free and model-based fashion, under both fully observable and partially observable settings. The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning. We prove that the proposed algorithm is sample efficient by establishing a sublinear regret upper bound in terms of GEC. In summary, we provide a new and unified understanding of both fully observable and partially observable RL., We changed the title from the first version. We fixed a technical issue in the first version regarding the $\ell_2$ eluder technique (Lemma D.2)
Published: 2022

17. A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

Author: Ai, Rui, Lyu, Boxiang, Wang, Zhaoran, Yang, Zhuoran, and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: We study reserve price optimization in multi-phase second price auctions, where seller's prior actions affect the bidders' later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges. First, from the seller's perspective, we need to efficiently explore the environment in the presence of potentially nontruthful bidders who aim to manipulates seller's policy. Second, we want to minimize the seller's revenue regret when the market noise distribution is unknown. Third, the seller's per-step revenue is unknown, nonlinear, and cannot even be directly observed from the environment. We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders' surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction's underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the $\underline{\rm C}$ontextual-$\underline{\rm L}$SVI-$\underline{\rm U}$CB-$\underline{\rm B}$uffer (CLUB) algorithm which achieves $\tilde{ \mathcal{O}}(H^{5/2}\sqrt{K})$ revenue regret when the market noise is known and $\tilde{ \mathcal{O}}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.
Published: 2022

18. Differentiable Bilevel Programming for Stackelberg Congestion Games

Author: Li, Jiayang, Yu, Jing, Wang, Qianni, Liu, Boyi, Wang, Zhaoran, and Nie, Yu Marco
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Computer Science and Game Theory, Computer Science - Artificial Intelligence, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Multiagent Systems, Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Computer Science and Game Theory (cs.GT), Multiagent Systems (cs.MA)
Abstract: A Stackelberg congestion game (SCG) is a bilevel program in which a leader aims to maximize their own gain by anticipating and manipulating the equilibrium state at which followers settle by playing a congestion game. Large-scale SCGs are well known for their intractability and complexity. This study approaches SCGs through differentiable programming, which marries the latest developments in machine learning with conventional methodologies. The core idea centers on representing the lower-level equilibrium problem using an evolution path formed by the imitative logit dynamics. It enables the use of automatic differentiation over the evolution path towards equilibrium, leading to a double-loop gradient descent algorithm. We further show the fixation on the lower-level equilibrium may be a self-imposed computational obstacle. Instead, the leader may only look ahead along the followers' evolution path for a few steps, while updating their decisions in sync with the followers through a co-evolution process. The revelation gives rise to a single-loop algorithm that is more efficient in terms of both memory consumption and computation time. Through numerical experiments that cover a wide range of benchmark problems, we find the single-loop algorithm consistently strikes a good balance between solution quality and efficiency, outperforming not only the standard double-loop implementation but also other methods from the literature. Importantly, our results highlight both the wastefulness of "full anticipation" and the peril of "zero anticipation". If a quick-and-dirty heuristic is needed for solving a really large SCG, the proposed single-loop algorithm with a one-step look-ahead makes an ideal candidate.
Published: 2022

19. Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

Author: Qiu, Shuang, Wang, Lingxiao, Bai, Chenjia, Yang, Zhuoran, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In view of its power in extracting feature representation, contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL), leading to efficient policy learning in various applications. Despite its tremendous empirical successes, the understanding of contrastive learning for RL remains elusive. To narrow such a gap, we study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. For both models, we propose to extract the correct feature representations of the low-rank model by minimizing a contrastive loss. Moreover, under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs. We further theoretically prove that our algorithm recovers the true representations and simultaneously achieves sample efficiency in learning the optimal policy and Nash equilibrium in MDPs and MGs. We also provide empirical studies to demonstrate the efficacy of the UCB-based contrastive learning method for RL. To the best of our knowledge, we provide the first provably efficient online RL algorithm that incorporates contrastive learning for representation learning. Our codes are available at https://github.com/Baichenjia/Contrastive-UCB., ICML 2022
Published: 2022

20. Federated Offline Reinforcement Learning

Author: Zhou, Doudou, Zhang, Yufeng, Sonabend-W, Aaron, Wang, Zhaoran, Lu, Junwei, and Cai, Tianxi
Subjects: Methodology (stat.ME), FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Statistics - Methodology, Machine Learning (cs.LG)
Abstract: Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a result, federated offline RL algorithms are necessary and promising to deal with the problems. In this paper, we propose a multi-site Markov decision process model which allows both homogeneous and heterogeneous effects across sites. The proposed model makes the analysis of the site-level features possible. We design the first federated policy optimization algorithm for offline RL with sample complexity. The proposed algorithm is communication-efficient and privacy-preserving, which requires only a single round of communication interaction by exchanging summary statistics. We give a theoretical guarantee for the proposed algorithm without the assumption of sufficient action coverage, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed. Extensive simulations demonstrate the effectiveness of the proposed algorithm. The method is applied to a sepsis data set in multiple sites to illustrate its use in clinical settings.
Published: 2022

21. RORL: Robust Offline Reinforcement Learning via Conservative Smoothing

Author: Yang, Rui, Bai, Chenjia, Ma, Xiaoteng, Wang, Zhaoran, Zhang, Chongjie, and Han, Lei
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Offline reinforcement learning (RL) provides a promising direction to exploit massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative in value estimation and action selection. However, such conservatism can impair the robustness of learned policies when encountering observation deviation under realistic conditions, such as sensor errors and adversarial attacks. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset, as well as additional conservative value estimation on these states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbations., Accepted by Advances in Neural Information Processing Systems (NeurIPS) 2022
Published: 2022

22. Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

Author: Cai, Qi, Yang, Zhuoran, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OP-TENET) that attains an $\epsilon$-optimal policy within $O(1/\epsilon^2)$ episodes. In particular, the sample complexity scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces. The sample efficiency of OP-TENET is enabled by a sequence of ingredients: (i) a Bellman operator with finite memory, which represents the value function in a recursive manner, (ii) the identification and estimation of such an operator via an adversarial integral equation, which features a smoothed discriminator tailored to the linear structure, and (iii) the exploration of the observation and state spaces via optimism, which is based on quantifying the uncertainty in the adversarial integral equation.
Published: 2022

23. Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

Author: Bai, Chenjia, Wang, Lingxiao, Yang, Zhuoran, Deng, Zhihong, Garg, Animesh, Liu, Peng, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms., ICLR 2022
Published: 2022

24. Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Author: Wu, Jibang, Zhang, Zixuan, Feng, Zhe, Wang, Zhaoran, Yang, Zhuoran, Jordan, Michael I., and Xu, Haifeng
Subjects: FOS: Computer and information sciences, FOS: Economics and business, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Economics - Theoretical Economics, Theoretical Economics (econ.TH), Computer Science and Game Theory (cs.GT), Machine Learning (cs.LG)
Abstract: In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers. This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximizes the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. $\epsilon$-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. Our algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.
Published: 2022

25. Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Author: Chen, Xiaoyu, Zhong, Han, Yang, Zhuoran, Wang, Zhaoran, and Wang, Liwei
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
Published: 2022
Full Text: View/download PDF

26. Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

Author: Qiu, Shuang, Wei, Xiaohan, Ye, Jieping, Wang, Zhaoran, and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$., Comment: ICML 2021
Published: 2022
Full Text: View/download PDF

27. Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach

Author: Lyu, Boxiang, Meng, Qinglin, Qiu, Shuang, Wang, Zhaoran, Yang, Zhuoran, and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: Dynamic mechanism design studies how mechanism designers should allocate resources among agents in a time-varying environment. We consider the problem where the agents interact with the mechanism designer according to an unknown Markov Decision Process (MDP), where agent rewards and the mechanism designer's state evolve according to an episodic MDP with unknown reward functions and transition kernels. We focus on the online setting with linear function approximation and attempt to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key contribution of our work is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space to estimate prices in the dynamic VCG mechanism. We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $\tilde{\mathcal{O}}(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our work establishes the regret guarantee for online RL in solving dynamic mechanism design problems without prior knowledge of the underlying model., Comment: The first three authors contribute equally and are listed in alphabetical order
Published: 2022
Full Text: View/download PDF

28. Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

Author: Fu, Zuyue, Qi, Zhengling, Yang, Zhuoran, Wang, Zhaoran, and Wang, Lan
Subjects: Methodology (stat.ME), FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Statistics - Methodology, Machine Learning (cs.LG)
Abstract: Motivated by the human-machine interaction such as training chatbots for improving customer satisfaction, we study human-guided human-machine interaction involving private information. We model this interaction as a two-player turn-based game, where one player (Alice, a human) guides the other player (Bob, a machine) towards a common goal. Specifically, we focus on offline reinforcement learning (RL) in this game, where the goal is to find a policy pair for Alice and Bob that maximizes their expected total rewards based on an offline dataset collected a priori. The offline setting presents two challenges: (i) We cannot collect Bob's private information, leading to a confounding bias when using standard RL methods, and (ii) a distributional mismatch between the behavior policy used to collect data and the desired policy we aim to learn. To tackle the confounding bias, we treat Bob's previous action as an instrumental variable for Alice's current decision making so as to adjust for the unmeasured confounding. We develop a novel identification result and use it to propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game. To tackle the distributional mismatch, we leverage the idea of pessimism and use our OPE method to develop an off-policy learning algorithm for finding a desirable policy pair for both Alice and Bob. Finally, we prove that under mild assumptions such as partial coverage of the offline data, the policy pair obtained through our method converges to the optimal one at a satisfactory rate.
Published: 2022
Full Text: View/download PDF

29. Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

Author: Wang, Lingxiao, Cai, Qi, Yang, Zhuoran, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, Machine Learning (stat.ML), Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Machine Learning (cs.LG)
Abstract: Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
Published: 2022
Full Text: View/download PDF

30. Learn to Match with No Regret: Reinforcement Learning in Markov Matching Markets

Author: Min, Yifei, Wang, Tianhao, Xu, Ruitu, Wang, Zhaoran, Jordan, Michael I., and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, FOS: Mathematics, Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: We study a Markov matching market involving a planner and a set of strategic agents on the two sides of the market. At each step, the agents are presented with a dynamical context, where the contexts determine the utilities. The planner controls the transition of the contexts to maximize the cumulative social welfare, while the agents aim to find a myopic stable matching at each step. Such a setting captures a range of applications including ridesharing platforms. We formalize the problem by proposing a reinforcement learning framework that integrates optimistic value iteration with maximum weight matching. The proposed algorithm addresses the coupled challenges of sequential exploration, matching stability, and function approximation. We prove that the algorithm achieves sublinear regret., Comment: 40 pages
Published: 2022
Full Text: View/download PDF

31. Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Author: Lu, Miao, Min, Yifei, Wang, Zhaoran, and Yang, Zhuoran
Subjects: Methodology (stat.ME), FOS: Computer and information sciences, Computer Science - Machine Learning, ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, FOS: Mathematics, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), Statistics - Methodology, Machine Learning (cs.LG)
Abstract: We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset., Comment: update after camera ready revision
Published: 2022
Full Text: View/download PDF

32. Offline Policy Optimization in RL with Variance Regularizaton

Author: Islam, Riashat, Sinha, Samarth, Bharadhwaj, Homanga, Arnob, Samin Yeasar, Yang, Zhuoran, Garg, Animesh, Wang, Zhaoran, Li, Lihong, and Precup, Doina
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms., Comment: Old Draft, Offline RL Workshop, NeurIPS'20
Published: 2022
Full Text: View/download PDF

33. Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning

Author: Lyu, Boxiang, Wang, Zhaoran, Kolar, Mladen, and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: Dynamic mechanism design has garnered significant attention from both computer scientists and economists in recent years. By allowing agents to interact with the seller over multiple rounds, where agents' reward functions may change with time and are state-dependent, the framework is able to model a rich class of real-world problems. In these works, the interaction between agents and sellers is often assumed to follow a Markov Decision Process (MDP). We focus on the setting where the reward and transition functions of such an MDP are not known a priori, and we are attempting to recover the optimal mechanism using an a priori collected data set. In the setting where the function approximation is employed to handle large state spaces, with only mild assumptions on the expressiveness of the function class, we are able to design a dynamic mechanism using offline reinforcement learning algorithms. Moreover, learned mechanisms approximately have three key desiderata: efficiency, individual rationality, and truthfulness. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set. To the best of our knowledge, our work provides the first offline RL algorithm for dynamic mechanism design without assuming uniform coverage., Comment: 52 pages
Published: 2022
Full Text: View/download PDF

34. Relational Reasoning via Set Transformers: Provable Efficiency and Applications to MARL

Author: Zhang, Fengzhuo, Liu, Boyi, Wang, Kaixin, Tan, Vincent Y. F., Yang, Zhuoran, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Multiagent Systems, Machine Learning (stat.ML), Machine Learning (cs.LG), Multiagent Systems (cs.MA)
Abstract: The cooperative Multi-A gent R einforcement Learning (MARL) with permutation invariant agents framework has achieved tremendous empirical successes in real-world applications. Unfortunately, the theoretical understanding of this MARL problem is lacking due to the curse of many agents and the limited exploration of the relational reasoning in existing works. In this paper, we verify that the transformer implements complex relational reasoning, and we propose and analyze model-free and model-based offline MARL algorithms with the transformer approximators. We prove that the suboptimality gaps of the model-free and model-based algorithms are independent of and logarithmic in the number of agents respectively, which mitigates the curse of many agents. These results are consequences of a novel generalization error bound of the transformer and a novel analysis of the Maximum Likelihood Estimate (MLE) of the system dynamics with the transformer. Our model-based algorithm is the first provably efficient MARL algorithm that explicitly exploits the permutation invariance of the agents. Our improved generalization bound may be of independent interest and is applicable to other regression problems related to the transformer beyond MARL.
Published: 2022
Full Text: View/download PDF

35. Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes

Author: Fu, Zuyue, Qi, Zhengling, Wang, Zhaoran, Yang, Zhuoran, Xu, Yanxun, and Kosorok, Michael R.
Subjects: Methodology (stat.ME), FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Methodology, Machine Learning (cs.LG)
Abstract: We study the offline reinforcement learning (RL) in the face of unmeasured confounders. Due to the lack of online interaction with the environment, offline RL is facing the following two significant challenges: (i) the agent may be confounded by the unobserved state variables; (ii) the offline data collected a prior does not provide sufficient coverage for the environment. To tackle the above challenges, we study the policy learning in the confounded MDPs with the aid of instrumental variables. Specifically, we first establish value function (VF)-based and marginalized importance sampling (MIS)-based identification results for the expected total reward in the confounded MDPs. Then by leveraging pessimism and our identification results, we propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy under minimal data coverage and modeling assumptions. Lastly, our extensive theoretical investigations and one numerical study motivated by the kidney transplantation demonstrate the promising performance of the proposed methods.
Published: 2022
Full Text: View/download PDF

36. Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments

Author: Wang, Yixuan, Zhan, Simon Sinong, Jiao, Ruochen, Wang, Zhilu, Jin, Wanxin, Yang, Zhuoran, Wang, Zhaoran, Huang, Chao, and Zhu, Qi
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Machine Learning (cs.LG)
Abstract: It is quite challenging to ensure the safety of reinforcement learning (RL) agents in an unknown and stochastic environment under hard constraints that require the system state not to reach certain specified unsafe regions. Many popular safe RL methods such as those based on the Constrained Markov Decision Process (CMDP) paradigm formulate safety violations in a cost function and try to constrain the expectation of cumulative cost under a threshold. However, it is often difficult to effectively capture and enforce hard reachability-based safety constraints indirectly with such constraints on safety violation costs. In this work, we leverage the notion of barrier function to explicitly encode the hard safety constraints, and given that the environment is unknown, relax them to our design of \emph{generative-model-based soft barrier functions}. Based on such soft barriers, we propose a safe RL approach that can jointly learn the environment and optimize the control policy, while effectively avoiding unsafe regions with safety probability optimization. Experiments on a set of examples demonstrate that our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations., Comment: Accepted to ICML 2023
Published: 2022
Full Text: View/download PDF

37. Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

Author: Zhong, Han, Xiong, Wei, Tan, Jiyuan, Wang, Liwei, Zhang, Tong, Wang, Zhaoran, and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
Published: 2022
Full Text: View/download PDF

38. Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning

Author: Fei, Yingjie, Yang, Zhuoran, Chen, Yudong, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We study risk-sensitive reinforcement learning (RL) based on the entropic risk measure. Although existing works have established non-asymptotic regret guarantees for this problem, they leave open an exponential gap between the upper and lower bounds. We identify the deficiencies in existing algorithms and their analysis that result in such a gap. To remedy these deficiencies, we investigate a simple transformation of the risk-sensitive Bellman equations, which we call the exponential Bellman equation. The exponential Bellman equation inspires us to develop a novel analysis of Bellman backup procedures in risk-sensitive RL algorithms, and further motivates the design of a novel exploration mechanism. We show that these analytic and algorithmic innovations together lead to improved regret upper bounds over existing ones.
Published: 2021

39. Dynamic Bottleneck for Robust Self-Supervised Exploration

Author: Bai, Chenjia, Wang, Lingxiao, Han, Lei, Garg, Animesh, Hao, Jianye, Liu, Peng, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: Exploration methods based on pseudo-count of transitions or curiosity of dynamics have achieved promising results in solving reinforcement learning with sparse rewards. However, such methods are usually sensitive to environmental dynamics-irrelevant information, e.g., white-noise. To handle such dynamics-irrelevant information, we propose a Dynamic Bottleneck (DB) model, which attains a dynamics-relevant representation based on the information-bottleneck principle. Based on the DB model, we further propose DB-bonus, which encourages the agent to explore state-action pairs with high information gain. We establish theoretical connections between the proposed DB-bonus, the upper confidence bound (UCB) for linear case, and the visiting count for tabular case. We evaluate the proposed method on Atari suits with dynamics-irrelevant noises. Our experiments show that exploration with DB bonus outperforms several state-of-the-art exploration methods in noisy environments., NeurIPS 2021
Published: 2021

40. On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

Author: Qiu, Shuang, Ye, Jieping, Wang, Zhaoran, and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG), Computer Science and Game Theory (cs.GT)
Abstract: To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators., ICML 2021
Published: 2021

41. Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Author: Zhong, Han, Yang, Zhuoran, Wang, Zhaoran, and Szepesvári, Csaba
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the $\underline{\text{p}}$eriodically $\underline{\text{r}}$estarted $\underline{\text{o}}$ptimistic $\underline{\text{p}}$olicy $\underline{\text{o}}$ptimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear function approximation. PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement, which are tailored for policy optimization in a non-stationary environment. In addition, only utilizing the technique of sliding window, we propose a value-iteration algorithm. We establish dynamic upper bounds for the proposed methods and a matching minimax lower bound which shows the (near-) optimality of the proposed methods. To our best knowledge, PROPO is the first provably efficient policy optimization algorithm that handles non-stationarity.
Published: 2021

42. Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

Author: Liu, Zhihan, Zhang, Yufeng, Fu, Zuyue, Yang, Zhuoran, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Optimization and Control (math.OC), Computer Science - Artificial Intelligence, Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain predefined reward set. In this paper, we study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps. Besides the expert demonstration, in the online setting the agent can interact with the environment, while in the offline setting the agent only accesses an additional dataset collected by a prior. For online GAIL, we propose an optimistic generative adversarial policy optimization algorithm (OGAP) and prove that OGAP achieves $\widetilde{\mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$ regret. Here $N_1$ represents the number of trajectories of the expert demonstration, $d$ is the feature dimension, and $K$ is the number of episodes. For offline GAIL, we propose a pessimistic generative adversarial policy optimization algorithm (PGAP). For an arbitrary additional dataset, we obtain the optimality gap of PGAP, achieving the minimax lower bound in the utilization of the additional dataset. Assuming sufficient coverage on the additional dataset, we show that PGAP achieves $\widetilde{\mathcal{O}}(H^{2}dK^{-1/2} +H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} \ )$ optimality gap. Here $N_2$ represents the number of trajectories of the additional dataset with sufficient coverage., 54 pages, in submission
Published: 2021

43. Towards General Function Approximation in Zero-Sum Markov Games

Author: Huang, Baihe, Lee, Jason D., Wang, Zhaoran, and Yang, Zhuoran
Subjects: FOS: Computer and information sciences, Computer Science::Computer Science and Game Theory, Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning, Machine Learning (stat.ML), Computer Science and Game Theory (cs.GT), Machine Learning (cs.LG)
Abstract: This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $\sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $\sqrt{K}$-regret upper bound where $K$ is the number of episodes.
Published: 2021

44. A Unified Off-Policy Evaluation Approach for General Value Function

Author: Xu, Tengyu, Yang, Zhuoran, Wang, Zhaoran, and Liang, Yingbin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, FOS: Mathematics, Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: General Value Function (GVF) is a powerful tool to represent both the {\em predictive} and {\em retrospective} knowledge in reinforcement learning (RL). In practice, often multiple interrelated GVFs need to be evaluated jointly with pre-collected off-policy samples. In the literature, the gradient temporal difference (GTD) learning method has been adopted to evaluate GVFs in the off-policy setting, but such an approach may suffer from a large estimation error even if the function approximation class is sufficiently expressive. Moreover, none of the previous work have formally established the convergence guarantee to the ground truth GVFs under the function approximation settings. In this paper, we address both issues through the lens of a class of GVFs with causal filtering, which cover a wide range of RL applications such as reward variance, value gradient, cost in anomaly detection, stationary distribution gradient, etc. We propose a new algorithm called GenTD for off-policy GVFs evaluation and show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function. We further show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to the ground truth GVFs as long as the function approximation power is sufficiently large. To our best knowledge, GenTD is the first off-policy GVF evaluation algorithm that has global optimality guarantee., submitted for publication
Published: 2021

45. Gap-Dependent Bounds for Two-Player Markov Games

Author: Dou, Zehao, Yang, Zhuoran, Wang, Zhaoran, and Du, Simon S.
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Computer Science::Computer Science and Game Theory, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings., 34 pages
Published: 2021

46. Verification in the Loop: Correct-by-Construction Control Learning with Reach-avoid Guarantees

Author: Wang, Yixuan, Huang, Chao, Wang, Zhaoran, Wang, Zhilu, and Zhu, Qi
Subjects: FOS: Computer and information sciences, Formal Languages and Automata Theory (cs.FL), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Formal Languages and Automata Theory, Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control
Abstract: In the current control design of safety-critical autonomous systems, formal verification techniques are typically applied after the controller is designed to evaluate whether the required properties (e.g., safety) are satisfied. However, due to the increasing system complexity and the fundamental hardness of designing a controller with formal guarantees, such an open-loop process of design-then-verify often results in many iterations and fails to provide the necessary guarantees. In this paper, we propose a correct-by-construction control learning framework that integrates the verification into the control design process in a closed-loop manner, i.e., design-while-verify. Specifically, we leverage the verification results (computed reachable set of the system state) to construct feedback metrics for control learning, which measure how likely the current design of control parameters can meet the required reach-avoid property for safety and goal-reaching. We formulate an optimization problem based on such metrics for tuning the controller parameters, and develop an approximated gradient descent algorithm with a difference method to solve the optimization problem and learn the controller. The learned controller is formally guaranteed to meet the required reach-avoid property. By treating verifiability as a first-class objective and effectively leveraging the verification results during the control learning process, our approach can significantly improve the chance of finding a control design with formal property guarantees. This is demonstrated via a set of experiments on both linear and non-linear systems that use model-based or neural network based controllers., Under review
Published: 2021

47. Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning

Author: Liao, Luofeng, Fu, Zuyue, Yang, Zhuoran, Wang, Yixin, Kolar, Mladen, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In offline reinforcement learning (RL) an optimal policy is learnt solely from a priori collected observational data. However, in observational data, actions are often confounded by unobserved variables. Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables are all mediated through the action. When a valid instrument is present, we can recover the confounded transition dynamics through observational data. We study a confounded Markov decision process where the transition dynamics admit an additive nonlinear functional form. Using IVs, we derive a conditional moment restriction (CMR) through which we can identify transition dynamics based on observational data. We propose a provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of CMR. To the best of our knowledge, this is the first provably efficient algorithm for instrument-aided offline RL., under review
Published: 2021

48. ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning

Author: Liu, Xiao-Yang, Li, Zechu, Yang, Zhuoran, Zheng, Jiahao, Wang, Zhaoran, Walid, Anwar, Guo, Jian, and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: Deep reinforcement learning (DRL) has revolutionized learning and actuation in applications such as game playing and robotic control. The cost of data collection, i.e., generating transitions from agent-environment interactions, remains a major challenge for wider DRL adoption in complex real-world problems. Following a cloud-native paradigm to train DRL agents on a GPU cloud platform is a promising solution. In this paper, we present a scalable and elastic library ElegantRL-podracer for cloud-native deep reinforcement learning, which efficiently supports millions of GPU cores to carry out massively parallel training at multiple levels. At a high-level, ElegantRL-podracer employs a tournament-based ensemble scheme to orchestrate the training process on hundreds or even thousands of GPUs, scheduling the interactions between a leaderboard and a training pool with hundreds of pods. At a low-level, each pod simulates agent-environment interactions in parallel by fully utilizing nearly 7,000 GPU CUDA cores in a single GPU. Our ElegantRL-podracer library features high scalability, elasticity and accessibility by following the development principles of containerization, microservices and MLOps. Using an NVIDIA DGX SuperPOD cloud, we conduct extensive experiments on various tasks in locomotion and stock trading and show that ElegantRL-podracer substantially outperforms RLlib. Our codes are available on GitHub., Comment: 9 pages, 7 figures
Published: 2021
Full Text: View/download PDF

49. FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance

Author: Liu, Xiao-Yang, Rui, Jingyang, Gao, Jiechao, Yang, Liuqing, Yang, Hongyang, Wang, Zhaoran, Wang, Christina Dan, and Guo, Jian
Subjects: FOS: Economics and business, FOS: Computer and information sciences, Computer Science - Machine Learning, Quantitative Finance - Trading and Market Microstructure, Trading and Market Microstructure (q-fin.TR), Machine Learning (cs.LG)
Abstract: Deep reinforcement learning (DRL) has shown huge potentials in building financial market simulators recently. However, due to the highly complex and dynamic nature of real-world markets, raw historical financial data often involve large noise and may not reflect the future of markets, degrading the fidelity of DRL-based market simulators. Moreover, the accuracy of DRL-based market simulators heavily relies on numerous and diverse DRL agents, which increases demand for a universe of market environments and imposes a challenge on simulation speed. In this paper, we present a FinRL-Meta framework that builds a universe of market environments for data-driven financial reinforcement learning. First, FinRL-Meta separates financial data processing from the design pipeline of DRL-based strategy and provides open-source data engineering tools for financial big data. Second, FinRL-Meta provides hundreds of market environments for various trading tasks. Third, FinRL-Meta enables multiprocessing simulation and training by exploiting thousands of GPU cores. Our codes are available online at https://github.com/AI4Finance-Foundation/FinRL-Meta., Comment: Workshop on Data Centric AI, 35th Conference on Neural Information Processing Systems (NeurIPS 2021)
Published: 2021
Full Text: View/download PDF

50. Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

Author: Xu, Tengyu, Yang, Zhuoran, Wang, Zhaoran, and Liang, Yingbin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Designing off-policy reinforcement learning algorithms is typically a very challenging task, because a desirable iteration update often involves an expectation over an on-policy distribution. Prior off-policy actor-critic (AC) algorithms have introduced a new critic that uses the density ratio for adjusting the distribution mismatch in order to stabilize the convergence, but at the cost of potentially introducing high biases due to the estimation errors of both the density ratio and value function. In this paper, we develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP, which can take advantage of learned nuisance functions to reduce estimation errors. Moreover, DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize, and is thus more sample efficient than prior algorithms that adopt either two timescale or nested-loop structure. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $\epsilon$-accurate optimal policy. We also show that the overall convergence of DR-Off-PAC is doubly robust to the approximation errors that depend only on the expressive power of approximation functions. To the best of our knowledge, our study establishes the first overall sample complexity analysis for a single time-scale off-policy AC algorithm., Comment: Published in ICML 2021
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

105 results on '"Wang, Zhaoran"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources