56 results on '"Chen, Sitan"'
Search Results
2. Critical windows: non-asymptotic theory for feature emergence in diffusion models
- Author
Li, Marvin, Chen, Sitan, Li, Marvin, and Chen, Sitan
- Abstract
We develop theory to understand an intriguing property of diffusion models for image generation that we term critical windows. Empirically, it has been observed that there are narrow time intervals in sampling during which particular features of the final image emerge, e.g. the image class or background color (Ho et al., 2020b; Meng et al., 2022; Choi et al., 2022; Raya & Ambrogioni, 2023; Georgiev et al., 2023; Sclocchi et al., 2024; Biroli et al., 2024). While this is advantageous for interpretability as it implies one can localize properties of the generation to a small segment of the trajectory, it seems at odds with the continuous nature of the diffusion. We propose a formal framework for studying these windows and show that for data coming from a mixture of strongly log-concave densities, these windows can be provably bounded in terms of certain measures of inter- and intra-group separation. We also instantiate these bounds for concrete examples like well-conditioned Gaussian mixtures. Finally, we use our bounds to give a rigorous interpretation of diffusion models as hierarchical samplers that progressively "decide" output features over a discrete sequence of times. We validate our bounds with synthetic experiments. Additionally, preliminary experiments on Stable Diffusion suggest critical windows may serve as a useful tool for diagnosing fairness and privacy violations in real-world diffusion models.
- Published
- 2024
3. An optimal tradeoff between entanglement and copy complexity for state tomography
- Author
Chen, Sitan, Li, Jerry, Liu, Allen, Chen, Sitan, Li, Jerry, and Liu, Allen
- Abstract
There has been significant interest in understanding how practical constraints on contemporary quantum devices impact the complexity of quantum learning. For the classic question of tomography, recent work tightly characterized the copy complexity for any protocol that can only measure one copy of the unknown state at a time, showing it is polynomially worse than if one can make fully-entangled measurements. While we now have a fairly complete picture of the rates for such tasks in the near-term and fault-tolerant regimes, it remains poorly understood what the landscape in between looks like. In this work, we study tomography in the natural setting where one can make measurements of $t$ copies at a time. For sufficiently small $\epsilon$, we show that for any $t \le d^2$, $\widetilde{\Theta}(\frac{d^3}{\sqrt{t}\epsilon^2})$ copies are necessary and sufficient to learn an unknown $d$-dimensional state $\rho$ to trace distance $\epsilon$. This gives a smooth and optimal interpolation between the known rates for single-copy and fully-entangled measurements. To our knowledge, this is the first smooth entanglement-copy tradeoff known for any quantum learning task, and for tomography, no intermediate point on this curve was known, even at $t = 2$. An important obstacle is that unlike the optimal single-copy protocol, the optimal fully-entangled protocol is inherently biased and thus precludes naive batching approaches. Instead, we devise a novel two-stage procedure that uses Keyl's algorithm to refine a crude estimate for $\rho$ based on single-copy measurements. A key insight is to use Schur-Weyl sampling not to estimate the spectrum of $\rho$, but to estimate the deviation of $\rho$ from the maximally mixed state. When $\rho$ is far from the maximally mixed state, we devise a novel quantum splitting procedure that reduces to the case where $\rho$ is close to maximally mixed., Comment: To appear at STOC 2024. Abstract shortened to meet arXiv requirement. 36 pages, comments welcome
- Published
- 2024
4. Provably learning a multi-head attention layer
- Author
Chen, Sitan, Li, Yuanzhi, Chen, Sitan, and Li, Yuanzhi
- Abstract
The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length $k$, attention matrices $\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_m\in\mathbb{R}^{d\times d}$, and projection matrices $\mathbf{W}_1,\ldots,\mathbf{W}_m\in\mathbb{R}^{d\times d}$, the corresponding multi-head attention layer $F: \mathbb{R}^{k\times d}\to \mathbb{R}^{k\times d}$ transforms length-$k$ sequences of $d$-dimensional tokens $\mathbf{X}\in\mathbb{R}^{k\times d}$ via $F(\mathbf{X}) \triangleq \sum^m_{i=1} \mathrm{softmax}(\mathbf{X}\mathbf{\Theta}_i\mathbf{X}^\top)\mathbf{X}\mathbf{W}_i$. In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem: - Provided $\{\mathbf{W}_i, \mathbf{\Theta}_i\}$ satisfy certain non-degeneracy conditions, we give a $(dk)^{O(m^3)}$-time algorithm that learns $F$ to small error given random labeled examples drawn uniformly from $\{\pm 1\}^{k\times d}$. - We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable. We focus on Boolean $\mathbf{X}$ to mimic the discrete nature of tokens in large language models, though our techniques naturally extend to standard continuous settings, e.g. Gaussian. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning feedforward networks, which predominantly exploit algebraic and rotation invariance properties of the Gaussian distribution. In contrast, our analysis is more flexible as it primarily relies on various upper and lower tail bounds for the input distribution and "slices" thereof., Comment: 105 pages, comments welcome
- Published
- 2024
5. Faster Diffusion-based Sampling with Randomized Midpoints: Sequential and Parallel
- Author
Gupta, Shivam, Cai, Linda, Chen, Sitan, Gupta, Shivam, Cai, Linda, and Chen, Sitan
- Abstract
In recent years, there has been a surge of interest in proving discretization bounds for diffusion models. These works show that for essentially any data distribution, one can approximately sample in polynomial time given a sufficiently accurate estimate of its score functions at different noise levels. In this work, we propose a new discretization scheme for diffusion models inspired by Shen and Lee's randomized midpoint method for log-concave sampling~\cite{ShenL19}. We prove that this approach achieves the best known dimension dependence for sampling from arbitrary smooth distributions in total variation distance ($\widetilde O(d^{5/12})$ compared to $\widetilde O(\sqrt{d})$ from prior work). We also show that our algorithm can be parallelized to run in only $\widetilde O(\log^2 d)$ parallel rounds, constituting the first provable guarantees for parallel sampling with diffusion models. As a byproduct of our methods, for the well-studied problem of log-concave sampling in total variation distance, we give an algorithm and simple analysis achieving dimension dependence $\widetilde O(d^{5/12})$ compared to $\widetilde O(\sqrt{d})$ from prior work.
- Published
- 2024
6. Optimal tradeoffs for estimating Pauli observables
- Author
Chen, Sitan, Gong, Weiyuan, Ye, Qi, Chen, Sitan, Gong, Weiyuan, and Ye, Qi
- Abstract
We revisit the problem of Pauli shadow tomography: given copies of an unknown $n$-qubit quantum state $\rho$, estimate $\text{tr}(P\rho)$ for some set of Pauli operators $P$ to within additive error $\epsilon$. This has been a popular testbed for exploring the advantage of protocols with quantum memory over those without: with enough memory to measure two copies at a time, one can use Bell sampling to estimate $|\text{tr}(P\rho)|$ for all $P$ using $O(n/\epsilon^4)$ copies, but with $k\le n$ qubits of memory, $\Omega(2^{(n-k)/3})$ copies are needed. These results leave open several natural questions. How does this picture change in the physically relevant setting where one only needs to estimate a certain subset of Paulis? What is the optimal dependence on $\epsilon$? What is the optimal tradeoff between quantum memory and sample complexity? We answer all of these questions. For any subset $A$ of Paulis and any family of measurement strategies, we completely characterize the optimal sample complexity, up to $\log |A|$ factors. We show any protocol that makes $\text{poly}(n)$-copy measurements must make $\Omega(1/\epsilon^4)$ measurements. For any protocol that makes $\text{poly}(n)$-copy measurements and only has $k < n$ qubits of memory, we show that $\widetilde{\Theta}(\min\{2^n/\epsilon^2, 2^{n-k}/\epsilon^4\})$ copies are necessary and sufficient. The protocols we propose can also estimate the actual values $\text{tr}(P\rho)$, rather than just their absolute values as in prior work. Additionally, as a byproduct of our techniques, we establish tight bounds for the task of purity testing and show that it exhibits an intriguing phase transition not present in the memory-sample tradeoff for Pauli shadow tomography., Comment: 59 pages, 1 figure
- Published
- 2024
7. Learning general Gaussian mixtures with efficient score matching
- Author
Chen, Sitan, Kontonis, Vasilis, Shah, Kulin, Chen, Sitan, Kontonis, Vasilis, and Shah, Kulin
- Abstract
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions. We make no separation assumptions on the underlying mixture components: we only require that the covariance matrices have bounded condition number and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\mathrm{poly}(k/\varepsilon)}$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler whose output distribution is $\varepsilon$-far from the unknown mixture in total variation. Prior works for this problem either (i) required exponential runtime in the dimension $d$, (ii) placed strong assumptions on the instance (e.g., spherical covariances or clusterability), or (iii) had doubly exponential dependence on the number of components $k$. Our approach departs from commonly used techniques for this problem like the method of moments. Instead, we leverage a recently developed reduction, based on diffusion models, from distribution learning to a supervised learning task called score matching. We give an algorithm for the latter by proving a structural result showing that the score function of a Gaussian mixture can be approximated by a piecewise-polynomial function, and there is an efficient algorithm for finding it. To our knowledge, this is the first example of diffusion models achieving a state-of-the-art theoretical guarantee for an unsupervised learning task., Comment: 57 pages
- Published
- 2024
8. Restoration-Degradation Beyond Linear Diffusions: A Non-Asymptotic Analysis For DDIM-Type Samplers
- Author
Chen, Sitan, Daras, Giannis, Dimakis, Alexandros G., Chen, Sitan, Daras, Giannis, and Dimakis, Alexandros G.
- Abstract
We develop a framework for non-asymptotic analysis of deterministic samplers used for diffusion generative modeling. Several recent works have analyzed stochastic samplers using tools like Girsanov's theorem and a chain rule variant of the interpolation argument. Unfortunately, these techniques give vacuous bounds when applied to deterministic samplers. We give a new operational interpretation for deterministic sampling by showing that one step along the probability flow ODE can be expressed as two steps: 1) a restoration step that runs gradient ascent on the conditional log-likelihood at some infinitesimally previous time, and 2) a degradation step that runs the forward process using noise pointing back towards the current iterate. This perspective allows us to extend denoising diffusion implicit models to general, non-linear forward processes. We then develop the first polynomial convergence bounds for these samplers under mild conditions on the data distribution., Comment: 29 pages
- Published
- 2023
9. The probability flow ODE is provably fast
- Author
Chen, Sitan, Chewi, Sinho, Lee, Holden, Li, Yuanzhi, Lu, Jianfeng, Salim, Adil, Chen, Sitan, Chewi, Sinho, Lee, Holden, Li, Yuanzhi, Lu, Jianfeng, and Salim, Adil
- Abstract
We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework., Comment: 23 pages, 2 figures
- Published
- 2023
10. Learning Narrow One-Hidden-Layer ReLU Networks
- Author
Chen, Sitan, Dou, Zehao, Goel, Surbhi, Klivans, Adam R, Meka, Raghu, Chen, Sitan, Dou, Zehao, Goel, Surbhi, Klivans, Adam R, and Meka, Raghu
- Abstract
We consider the well-studied problem of learning a linear combination of $k$ ReLU activations with respect to a Gaussian distribution on inputs in $d$ dimensions. We give the first polynomial-time algorithm that succeeds whenever $k$ is a constant. All prior polynomial-time learners require additional assumptions on the network, such as positive combining coefficients or the matrix of hidden weight vectors being well-conditioned. Our approach is based on analyzing random contractions of higher-order moment tensors. We use a multi-scale analysis to argue that sufficiently close neurons can be collapsed together, sidestepping the conditioning issues present in prior work. This allows us to design an iterative procedure to discover individual neurons., Comment: 33 pages, comments welcome
- Published
- 2023
11. Efficient Pauli channel estimation with logarithmic quantum memory
- Author
Chen, Sitan, Gong, Weiyuan, Chen, Sitan, and Gong, Weiyuan
- Abstract
Here we revisit one of the prototypical tasks for characterizing the structure of noise in quantum devices: estimating every eigenvalue of an $n$-qubit Pauli noise channel to error $\epsilon$. Prior work (Chen et al., 2022) proved no-go theorems for this task in the practical regime where one has a limited amount of quantum memory, e.g. any protocol with $\le 0.99n$ ancilla qubits of quantum memory must make exponentially many measurements, provided it is non-concatenating. Such protocols can only interact with the channel by repeatedly preparing a state, passing it through the channel, and measuring immediately afterward. This left open a natural question: does the lower bound hold even for general protocols, i.e. ones which chain together many queries to the channel, interleaved with arbitrary data-processing channels, before measuring? Surprisingly, in this work we show the opposite: there is a protocol that can estimate the eigenvalues of a Pauli channel to error $\epsilon$ using only $O(\log n/\epsilon^2)$ ancilla qubits and $\tilde{O}(n^2/\epsilon^2)$ measurements. In contrast, we show that any protocol with zero ancilla, even a concatenating one, must make $\Omega(2^n/\epsilon^2)$ measurements, which is tight. Our results imply, to our knowledge, the first quantum learning task where logarithmically many qubits of quantum memory suffice for an exponential statistical advantage., Comment: 57 pages, 3 figures
- Published
- 2023
12. A faster and simpler algorithm for learning shallow networks
- Author
Chen, Sitan, Narayanan, Shyam, Chen, Sitan, and Narayanan, Shyam
- Abstract
We revisit the well-studied problem of learning a linear combination of $k$ ReLU activations given labeled examples drawn from the standard $d$-dimensional Gaussian measure. Chen et al. [CDG+23] recently gave the first algorithm for this problem to run in $\text{poly}(d,1/\varepsilon)$ time when $k = O(1)$, where $\varepsilon$ is the target error. More precisely, their algorithm runs in time $(d/\varepsilon)^{\mathrm{quasipoly}(k)}$ and learns over multiple stages. Here we show that a much simpler one-stage version of their algorithm suffices, and moreover its runtime is only $(d/\varepsilon)^{O(k^2)}$., Comment: 14 pages
- Published
- 2023
13. Learning Mixtures of Gaussians Using the DDPM Objective
- Author
Shah, Kulin, Chen, Sitan, Klivans, Adam, Shah, Kulin, Chen, Sitan, and Klivans, Adam
- Abstract
Recent works have shown that diffusion models can learn essentially any distribution provided one can perform score estimation. Yet it remains poorly understood under what settings score estimation is possible, let alone when practical gradient-based algorithms for this task can provably succeed. In this work, we give the first provably efficient results along these lines for one of the most fundamental distribution families, Gaussian mixture models. We prove that gradient descent on the denoising diffusion probabilistic model (DDPM) objective can efficiently recover the ground truth parameters of the mixture model in the following two settings: 1) We show gradient descent with random initialization learns mixtures of two spherical Gaussians in $d$ dimensions with $1/\text{poly}(d)$-separated centers. 2) We show gradient descent with a warm start learns mixtures of $K$ spherical Gaussians with $\Omega(\sqrt{\log(\min(K,d))})$-separated centers. A key ingredient in our proofs is a new connection between score-based methods and two other approaches to distribution learning, the EM algorithm and spectral methods., Comment: 48 pages
- Published
- 2023
14. Beyond the low-degree algorithm: mixtures of subcubes and their applications
- Author
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, Moitra, Ankur, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, and Moitra, Ankur
- Published
- 2022
15. Symmetric Sparse Boolean Matrix Factorization and Applications
- Author
Chen, Sitan, Song, Zhao, Tao, Runzhou, Zhang, Ruizhe, Chen, Sitan, Song, Zhao, Tao, Runzhou, and Zhang, Ruizhe
- Published
- 2022
- Full Text
- View/download PDF
16. Symmetric Sparse Boolean Matrix Factorization and Applications
- Author
Chen, Sitan, Song, Zhao, Tao, Runzhou, Zhang, Ruizhe, Chen, Sitan, Song, Zhao, Tao, Runzhou, and Zhang, Ruizhe
- Published
- 2022
- Full Text
- View/download PDF
17. Learning to predict arbitrary quantum processes
- Author
Huang, Hsin-Yuan, Chen, Sitan, Preskill, John, Huang, Hsin-Yuan, Chen, Sitan, and Preskill, John
- Abstract
We present an efficient machine learning (ML) algorithm for predicting any unknown quantum process $\mathcal{E}$ over $n$ qubits. For a wide range of distributions $\mathcal{D}$ on arbitrary $n$-qubit states, we show that this ML algorithm can learn to predict any local property of the output from the unknown process~$\mathcal{E}$, with a small average error over input states drawn from $\mathcal{D}$. The ML algorithm is computationally efficient even when the unknown process is a quantum circuit with exponentially many gates. Our algorithm combines efficient procedures for learning properties of an unknown state and for learning a low-degree approximation to an unknown observable. The analysis hinges on proving new norm inequalities, including a quantum analogue of the classical Bohnenblust-Hille inequality, which we derive by giving an improved algorithm for optimizing local Hamiltonians. Numerical experiments on predicting quantum dynamics with evolution time up to $10^6$ and system size up to $50$ qubits corroborate our proof. Overall, our results highlight the potential for ML models to predict the output of complex quantum dynamics much faster than the time needed to run the process itself., Comment: 16 pages, 5 figure + 36-page appendix; v3: Added numerical experiments; open source code available at https://github.com/hsinyuan-huang/learning-quantum-process
- Published
- 2022
18. The Complexity of NISQ
- Author
Chen, Sitan, Cotler, Jordan, Huang, Hsin-Yuan, Li, Jerry, Chen, Sitan, Cotler, Jordan, Huang, Hsin-Yuan, and Li, Jerry
- Abstract
The recent proliferation of NISQ devices has made it imperative to understand their computational power. In this work, we define and study the complexity class $\textsf{NISQ} $, which is intended to encapsulate problems that can be efficiently solved by a classical computer with access to a NISQ device. To model existing devices, we assume the device can (1) noisily initialize all qubits, (2) apply many noisy quantum gates, and (3) perform a noisy measurement on all qubits. We first give evidence that $\textsf{BPP}\subsetneq \textsf{NISQ}\subsetneq \textsf{BQP}$, by demonstrating super-polynomial oracle separations among the three classes, based on modifications of Simon's problem. We then consider the power of $\textsf{NISQ}$ for three well-studied problems. For unstructured search, we prove that $\textsf{NISQ}$ cannot achieve a Grover-like quadratic speedup over $\textsf{BPP}$. For the Bernstein-Vazirani problem, we show that $\textsf{NISQ}$ only needs a number of queries logarithmic in what is required for $\textsf{BPP}$. Finally, for a quantum state learning problem, we prove that $\textsf{NISQ}$ is exponentially weaker than classical computation with access to noiseless constant-depth quantum circuits., Comment: 15+37 pages, 3 figures
- Published
- 2022
19. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions
- Author
Chen, Sitan, Chewi, Sinho, Li, Jerry, Li, Yuanzhi, Salim, Adil, Zhang, Anru R., Chen, Sitan, Chewi, Sinho, Li, Jerry, Li, Yuanzhi, Salim, Adil, and Zhang, Anru R.
- Abstract
We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an $L^2$-accurate score estimate (rather than $L^\infty$-accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does not reduce the complexity of SGMs., Comment: 29 pages
- Published
- 2022
20. Tight Bounds for Quantum State Certification with Incoherent Measurements
- Author
Chen, Sitan, Huang, Brice, Li, Jerry, Liu, Allen, Chen, Sitan, Huang, Brice, Li, Jerry, and Liu, Allen
- Abstract
We consider the problem of quantum state certification, where we are given the description of a mixed state $\sigma \in \mathbb{C}^{d \times d}$, $n$ copies of a mixed state $\rho \in \mathbb{C}^{d \times d}$, and $\varepsilon > 0$, and we are asked to determine whether $\rho = \sigma$ or whether $\| \rho - \sigma \|_1 > \varepsilon$. When $\sigma$ is the maximally mixed state $\frac{1}{d} I_d$, this is known as mixedness testing. We focus on algorithms which use incoherent measurements, i.e. which only measure one copy of $\rho$ at a time. Unlike those that use entangled, multi-copy measurements, these can be implemented without persistent quantum memory and thus represent a large class of protocols that can be run on current or near-term devices. For mixedness testing, there is a folklore algorithm which uses incoherent measurements and only needs $O(d^{3/2} / \varepsilon^2)$ copies. The algorithm is non-adaptive, that is, its measurements are fixed ahead of time, and is known to be optimal for non-adaptive algorithms. However, when the algorithm can make arbitrary incoherent measurements, the best known lower bound is only $\Omega (d^{4/3} / \varepsilon^2)$ [Bubeck-Chen-Li '20], and it has been an outstanding open problem to close this polynomial gap. In this work, 1) we settle the copy complexity of mixedness testing with incoherent measurements and show that $\Omega (d^{3/2} / \varepsilon^2)$ copies are necessary, and 2) we show the instance-optimal bounds for state certification to general $\sigma$ first derived by [Chen-Li-O'Donnell '21] for non-adaptive measurements also hold for arbitrary incoherent measurements. Qualitatively, our results say that adaptivity does not help at all for these problems. Our results are based on new techniques that allow us to reduce the problem to understanding certain matrix martingales, which we believe may be of independent interest., Comment: 55 pages, comments welcome; v2: bug fix for Claims 6.11 and 7.13
- Published
- 2022
21. Learning Polynomial Transformations
- Author
Chen, Sitan, Li, Jerry, Li, Yuanzhi, Zhang, Anru R., Chen, Sitan, Li, Jerry, Li, Yuanzhi, and Zhang, Anru R.
- Abstract
We consider the problem of learning high dimensional polynomial transformations of Gaussians. Given samples of the form $p(x)$, where $x\sim N(0, \mathrm{Id}_r)$ is hidden and $p: \mathbb{R}^r \to \mathbb{R}^d$ is a function where every output coordinate is a low-degree polynomial, the goal is to learn the distribution over $p(x)$. This problem is natural in its own right, but is also an important special case of learning deep generative models, namely pushforwards of Gaussians under two-layer neural networks with polynomial activations. Understanding the learnability of such generative models is crucial to understanding why they perform so well in practice. Our first main result is a polynomial-time algorithm for learning quadratic transformations of Gaussians in a smoothed setting. Our second main result is a polynomial-time algorithm for learning constant-degree polynomial transformations of Gaussian in a smoothed setting, when the rank of the associated tensors is small. In fact our results extend to any rotation-invariant input distribution, not just Gaussian. These are the first end-to-end guarantees for learning a pushforward under a neural network with more than one layer. Along the way, we also give the first polynomial-time algorithms with provable guarantees for tensor ring decomposition, a popular generalization of tensor decomposition that is used in practice to implicitly store large tensors., Comment: 121 pages, comments welcome
- Published
- 2022
22. Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks
- Author
Chen, Sitan, Gollakota, Aravind, Klivans, Adam R., Meka, Raghu, Chen, Sitan, Gollakota, Aravind, Klivans, Adam R., and Meka, Raghu
- Abstract
We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from label queries., Comment: 35 pages, v3: refined exposition
- Published
- 2022
23. Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs
- Author
Chen, Sitan, Li, Jerry, Li, Yuanzhi, Meka, Raghu, Chen, Sitan, Li, Jerry, Li, Yuanzhi, and Meka, Raghu
- Abstract
Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand to what extent GANs can actually learn the underlying distribution. Theoretical and empirical evidence suggests local optimality of the empirical training objective is insufficient. Yet, it does not rule out the possibility that achieving a true population minimax optimal solution might imply distribution learning. In this paper, we show that standard cryptographic assumptions imply that this stronger condition is still insufficient. Namely, we show that if local pseudorandom generators (PRGs) exist, then for a large family of natural continuous target distributions, there are ReLU network generators of constant depth and polynomial size which take Gaussian random seeds so that (i) the output is far in Wasserstein distance from the target distribution, but (ii) no polynomially large Lipschitz discriminator ReLU network can detect this. This implies that even achieving a population minimax optimal solution to the Wasserstein GAN objective is likely insufficient for distribution learning in the usual statistical sense. Our techniques reveal a deep connection between GANs and PRGs, which we believe will lead to further insights into the computational landscape of GANs., Comment: 32 pages, 1 figure
- Published
- 2022
24. When Does Adaptivity Help for Quantum State Learning?
- Author
Chen, Sitan, Huang, Brice, Li, Jerry, Liu, Allen, Sellke, Mark, Chen, Sitan, Huang, Brice, Li, Jerry, Liu, Allen, and Sellke, Mark
- Abstract
We consider the classic question of state tomography: given copies of an unknown quantum state $\rho\in\mathbb{C}^{d\times d}$, output $\widehat{\rho}$ which is close to $\rho$ in some sense, e.g. trace distance or fidelity. When one is allowed to make coherent measurements entangled across all copies, $\Theta(d^2/\epsilon^2)$ copies are necessary and sufficient to get trace distance $\epsilon$. Unfortunately, the protocols achieving this rate incur large quantum memory overheads that preclude implementation on near-term devices. On the other hand, the best known protocol using incoherent (single-copy) measurements uses $O(d^3/\epsilon^2)$ copies, and multiple papers have posed it as an open question to understand whether or not this rate is tight. In this work, we fully resolve this question, by showing that any protocol using incoherent measurements, even if they are chosen adaptively, requires $\Omega(d^3/\epsilon^2)$ copies, matching the best known upper bound. We do so by a new proof technique which directly bounds the ``tilt'' of the posterior distribution after measurements, which yields a surprisingly short proof of our lower bound, and which we believe may be of independent interest. While this implies that adaptivity does not help for tomography with respect to trace distance, we show that it actually does help for tomography with respect to infidelity. We give an adaptive algorithm that outputs a state which is $\gamma$-close in infidelity to $\rho$ using only $\tilde{O}(d^3/\gamma)$ copies, which is optimal for incoherent measurements. In contrast, it is known that any nonadaptive algorithm requires $\Omega(d^3/\gamma^2)$ copies. While it is folklore that in $2$ dimensions, one can achieve a scaling of $O(1/\gamma)$, to the best of our knowledge, our algorithm is the first to achieve the optimal rate in all dimensions., Comment: 22 pages
- Published
- 2022
25. Learning (Very) Simple Generative Models Is Hard
- Author
Chen, Sitan, Li, Jerry, Li, Yuanzhi, Chen, Sitan, Li, Jerry, and Li, Yuanzhi
- Abstract
Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network $F:\mathbb{R}^d\to\mathbb{R}^{d'}$, let $D$ be the distribution over $\mathbb{R}^{d'}$ given by pushing the standard Gaussian $\mathcal{N}(0,\textrm{Id}_d)$ through $F$. Given i.i.d. samples from $D$, the goal is to output any distribution close to $D$ in statistical distance. We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of $F$ are one-hidden-layer ReLU networks with $\log(d)$ neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for supervised learning and required at least two hidden layers and $\mathrm{poly}(d)$ neurons [Daniely-Vardi '21, Chen-Gollakota-Klivans-Meka '22]. The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with polynomially-bounded slopes such that the pushforward of $\mathcal{N}(0,1)$ under $f$ matches all low-degree moments of $\mathcal{N}(0,1)$., Comment: 24 pages, 2 figures
- Published
- 2022
26. Efficiently learning structured distributions from untrusted batches
- Author
Massachusetts Institute of Technology. Media Laboratory, Chen, Sitan, Li, Jerry, Moitra, Ankur, Massachusetts Institute of Technology. Media Laboratory, Chen, Sitan, Li, Jerry, and Moitra, Ankur
- Abstract
© 2020 ACM. We study the problem, introduced by Qiao and Valiant, of learning from untrusted batches. Here, we assume m users, all of whom have samples from some underlying distribution over 1, ..., n. Each user sends a batch of k i.i.d. samples from this distribution; however an "-fraction of users are untrustworthy and can send adversarially chosen responses. The goal of the algorithm is to learn in total variation distance. When k = 1 this is the standard robust univariate density estimation setting and it is well-understood that (") error is unavoidable. Suprisingly, Qiao and Valiant gave an estimator which improves upon this rate when k is large. Unfortunately, their algorithms run in time which is exponential in either n or k. We first give a sequence of polynomial time algorithms whose estimation error approaches the information-theoretically optimal bound for this problem. Our approach is based on recent algorithms derived from the sum-of-squares hierarchy, in the context of high-dimensional robust estimation. We show that algorithms for learning from untrusted batches can also be cast in this framework, but by working with a more complicated set of test functions. It turns out that this abstraction is quite powerful, and can be generalized to incorporate additional problem specific constraints. Our second and main result is to show that this technology can be leveraged to build in prior knowledge about the shape of the distribution. Crucially, this allows us to reduce the sample complexity of learning from untrusted batches to polylogarithmic in n for most natural classes of distributions, which is important in many applications. To do so, we demonstrate that these sum-of-squares algorithms for robust mean estimation can be made to handle complex combinatorial constraints (e.g. those arising from VC theory), which may be of independent technical interest.
- Published
- 2022
27. Kalman filtering with adversarial corruptions
- Author
Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, Koehler, Frederic, Moitra, Ankur, Yau, Morris, Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, Koehler, Frederic, Moitra, Ankur, and Yau, Morris
- Published
- 2022
28. Learning Mixtures of Linear Regressions in Subexponential Time via Fourier Moments
- Author
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Chen, Sitan, Li, Jerry, Song, Zhao, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Chen, Sitan, Li, Jerry, and Song, Zhao
- Published
- 2022
29. Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination
- Author
Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, Koehler, Frederic, Moitra, Ankur, Yau, Morris, Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, Koehler, Frederic, Moitra, Ankur, and Yau, Morris
- Published
- 2022
30. Algorithmic foundations for the diffraction limit
- Author
Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, Moitra, Ankur, Massachusetts Institute of Technology. Department of Mathematics, Chen, Sitan, and Moitra, Ankur
- Published
- 2022
31. Beyond the low-degree algorithm: mixtures of subcubes and their applications
- Author
Chen, Sitan, Moitra, Ankur, Chen, Sitan, and Moitra, Ankur
- Published
- 2021
32. Quantum advantage in learning from experiments
- Author
Huang, Hsin-Yuan, Broughton, Michael, Cotler, Jordan, Chen, Sitan, Li, Jerry, Mohseni, Masoud, Neven, Hartmut, Babbush, Ryan, Kueng, Richard, Preskill, John, McClean, Jarrod R., Huang, Hsin-Yuan, Broughton, Michael, Cotler, Jordan, Chen, Sitan, Li, Jerry, Mohseni, Masoud, Neven, Hartmut, Babbush, Ryan, Kueng, Richard, Preskill, John, and McClean, Jarrod R.
- Abstract
Quantum technology has the potential to revolutionize how we acquire and process experimental data to learn about the physical world. An experimental setup that transduces data from a physical system to a stable quantum memory, and processes that data using a quantum computer, could have significant advantages over conventional experiments in which the physical system is measured and the outcomes are processed using a classical computer. We prove that, in various tasks, quantum machines can learn from exponentially fewer experiments than those required in conventional experiments. The exponential advantage holds in predicting properties of physical systems, performing quantum principal component analysis on noisy states, and learning approximate models of physical dynamics. In some tasks, the quantum processing needed to achieve the exponential advantage can be modest; for example, one can simultaneously learn about many noncommuting observables by processing only two copies of the system. Conducting experiments with up to 40 superconducting qubits and 1300 quantum gates, we demonstrate that a substantial quantum advantage can be realized using today's relatively noisy quantum processors. Our results highlight how quantum technology can enable powerful new strategies to learn about nature., Comment: 6 pages, 17 figures + 46 page appendix; open-source code available at https://github.com/quantumlib/ReCirq/tree/master/recirq/qml_lfe
- Published
- 2021
- Full Text
- View/download PDF
33. A Hierarchy for Replica Quantum Advantage
- Author
Chen, Sitan, Cotler, Jordan, Huang, Hsin-Yuan, Li, Jerry, Chen, Sitan, Cotler, Jordan, Huang, Hsin-Yuan, and Li, Jerry
- Abstract
We prove that given the ability to make entangled measurements on at most $k$ replicas of an $n$-qubit state $\rho$ simultaneously, there is a property of $\rho$ which requires at least order $2^n$ measurements to learn. However, the same property only requires one measurement to learn if we can make an entangled measurement over a number of replicas polynomial in $k, n$. Because the above holds for each positive integer $k$, we obtain a hierarchy of tasks necessitating progressively more replicas to be performed efficiently. We introduce a powerful proof technique to establish our results, and also use this to provide new bounds for testing the mixedness of a quantum state., Comment: 3+17 pages, 2 figures; v2: typos fixed
- Published
- 2021
34. Exponential separations between learning with and without quantum memory
- Author
Chen, Sitan, Cotler, Jordan, Huang, Hsin-Yuan, Li, Jerry, Chen, Sitan, Cotler, Jordan, Huang, Hsin-Yuan, and Li, Jerry
- Abstract
We study the power of quantum memory for learning properties of quantum systems and dynamics, which is of great importance in physics and chemistry. Many state-of-the-art learning algorithms require access to an additional external quantum memory. While such a quantum memory is not required a priori, in many cases, algorithms that do not utilize quantum memory require much more data than those which do. We show that this trade-off is inherent in a wide range of learning problems. Our results include the following: (1) We show that to perform shadow tomography on an $n$-qubit state rho with $M$ observables, any algorithm without quantum memory requires $\Omega(\min(M, 2^n))$ samples of rho in the worst case. Up to logarithmic factors, this matches the upper bound of [HKP20] and completely resolves an open question in [Aar18, AR19]. (2) We establish exponential separations between algorithms with and without quantum memory for purity testing, distinguishing scrambling and depolarizing evolutions, as well as uncovering symmetry in physical dynamics. Our separations improve and generalize prior work of [ACQ21] by allowing for a broader class of algorithms without quantum memory. (3) We give the first tradeoff between quantum memory and sample complexity. We prove that to estimate absolute values of all $n$-qubit Pauli observables, algorithms with $k < n$ qubits of quantum memory require at least $\Omega(2^{(n-k)/3})$ samples, but there is an algorithm using $n$-qubit quantum memory which only requires $O(n)$ samples. The separations we show are sufficiently large and could already be evident, for instance, with tens of qubits. This provides a concrete path towards demonstrating real-world advantage for learning algorithms with quantum memory., Comment: 77 pages, 2 figures, many diagrams; accepted to FOCS 2021; v2: typos corrected
- Published
- 2021
35. Kalman Filtering with Adversarial Corruptions
- Author
Chen, Sitan, Koehler, Frederic, Moitra, Ankur, Yau, Morris, Chen, Sitan, Koehler, Frederic, Moitra, Ankur, and Yau, Morris
- Abstract
Here we revisit the classic problem of linear quadratic estimation, i.e. estimating the trajectory of a linear dynamical system from noisy measurements. The celebrated Kalman filter gives an optimal estimator when the measurement noise is Gaussian, but is widely known to break down when one deviates from this assumption, e.g. when the noise is heavy-tailed. Many ad hoc heuristics have been employed in practice for dealing with outliers. In a pioneering work, Schick and Mitter gave provable guarantees when the measurement noise is a known infinitesimal perturbation of a Gaussian and raised the important question of whether one can get similar guarantees for large and unknown perturbations. In this work we give a truly robust filter: we give the first strong provable guarantees for linear quadratic estimation when even a constant fraction of measurements have been adversarially corrupted. This framework can model heavy-tailed and even non-stationary noise processes. Our algorithm robustifies the Kalman filter in the sense that it competes with the optimal algorithm that knows the locations of the corruptions. Our work is in a challenging Bayesian setting where the number of measurements scales with the complexity of what we need to estimate. Moreover, in linear dynamical systems past information decays over time. We develop a suite of new techniques to robustly extract information across different time steps and over varying time scales., Comment: 57 pages, comments welcome
- Published
- 2021
36. Efficiently Learning Any One Hidden Layer ReLU Network From Queries
- Author
Chen, Sitan, Klivans, Adam R, Meka, Raghu, Chen, Sitan, Klivans, Adam R, and Meka, Raghu
- Abstract
Model extraction attacks have renewed interest in the classic problem of learning neural networks from queries. In this work we give the first polynomial-time algorithm for learning arbitrary one hidden layer neural networks activations provided black-box access to the network. Formally, we show that if $F$ is an arbitrary one hidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that is polynomial in all parameters that outputs a network $F'$ achieving low square loss relative to $F$ with respect to the Gaussian measure. While a number of works in the security literature have proposed and empirically demonstrated the effectiveness of certain algorithms for this problem, ours is the first with fully polynomial-time guarantees of efficiency even for worst-case networks (in particular our algorithm succeeds in the overparameterized setting)., Comment: To appear in Advances in Neural Information Processing Systems (NeurIPS 2021)
- Published
- 2021
37. Efficiently learning structured distributions from untrusted batches
- Author
Chen, Sitan, Li, Jerry, Moitra, Ankur, Chen, Sitan, Li, Jerry, and Moitra, Ankur
- Abstract
© 2020 ACM. We study the problem, introduced by Qiao and Valiant, of learning from untrusted batches. Here, we assume m users, all of whom have samples from some underlying distribution over 1, ..., n. Each user sends a batch of k i.i.d. samples from this distribution; however an "-fraction of users are untrustworthy and can send adversarially chosen responses. The goal of the algorithm is to learn in total variation distance. When k = 1 this is the standard robust univariate density estimation setting and it is well-understood that (") error is unavoidable. Suprisingly, Qiao and Valiant gave an estimator which improves upon this rate when k is large. Unfortunately, their algorithms run in time which is exponential in either n or k. We first give a sequence of polynomial time algorithms whose estimation error approaches the information-theoretically optimal bound for this problem. Our approach is based on recent algorithms derived from the sum-of-squares hierarchy, in the context of high-dimensional robust estimation. We show that algorithms for learning from untrusted batches can also be cast in this framework, but by working with a more complicated set of test functions. It turns out that this abstraction is quite powerful, and can be generalized to incorporate additional problem specific constraints. Our second and main result is to show that this technology can be leveraged to build in prior knowledge about the shape of the distribution. Crucially, this allows us to reduce the sample complexity of learning from untrusted batches to polylogarithmic in n for most natural classes of distributions, which is important in many applications. To do so, we demonstrate that these sum-of-squares algorithms for robust mean estimation can be made to handle complex combinatorial constraints (e.g. those arising from VC theory), which may be of independent technical interest.
- Published
- 2021
38. Improved bounds for randomly sampling colorings via linear programming
- Author
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Department of Mathematics, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Moitra, Ankur, Chen, Sitan, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Department of Mathematics, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Moitra, Ankur, and Chen, Sitan
- Abstract
Copyright © 2019 by SIAM. A well-known conjecture in computer science and statistical physics is that Glauber dynamics on the set of k-colorings of a graph G on n vertices with maximum degree ∆ is rapidly mixing for k ≥ ∆ + 2. In FOCS 1999, Vigoda [43] showed that the flip dynamics (and therefore also Glauber dynamics) is rapidly mixing for any k > 116 ∆. It turns out that there is a natural barrier at 116 , below which there is no one-step coupling that is contractive with respect to the Hamming metric, even for the flip dynamics. We use linear programming and duality arguments to fully characterize the obstructions to going beyond 116 . These extremal configurations turn out to be quite brittle, and in this paper we use this to give two proofs that the Glauber dynamics is rapidly mixing for any k ≥ (116 − 0)∆ for some absolute constant 0 > 0. This is the first improvement to Vigoda’s result that holds for general graphs. Our first approach analyzes a variable-length coupling in which these configurations break apart with high probability before the coupling terminates, and our other approach analyzes a one-step path coupling with a new metric that counts the extremal configurations. Additionally, our results extend to list coloring, a widely studied generalization of coloring, where the previously best known results required k > 2∆.
- Published
- 2021
39. Toward Instance-Optimal State Certification With Incoherent Measurements
- Author
Chen, Sitan, Li, Jerry, O'Donnell, Ryan, Chen, Sitan, Li, Jerry, and O'Donnell, Ryan
- Abstract
We revisit the basic problem of quantum state certification: given copies of unknown mixed state $\rho\in\mathbb{C}^{d\times d}$ and the description of a mixed state $\sigma$, decide whether $\sigma = \rho$ or $\|\sigma - \rho\|_{\mathsf{tr}} \ge \epsilon$. When $\sigma$ is maximally mixed, this is mixedness testing, and it is known that $\Omega(d^{\Theta(1)}/\epsilon^2)$ copies are necessary, where the exact exponent depends on the type of measurements the learner can make [OW15, BCL20], and in many of these settings there is a matching upper bound [OW15, BOW19, BCL20]. Can one avoid this $d^{\Theta(1)}$ dependence for certain kinds of mixed states $\sigma$, e.g. ones which are approximately low rank? More ambitiously, does there exist a simple functional $f:\mathbb{C}^{d\times d}\to\mathbb{R}_{\ge 0}$ for which one can show that $\Theta(f(\sigma)/\epsilon^2)$ copies are necessary and sufficient for state certification with respect to any $\sigma$? Such instance-optimal bounds are known in the context of classical distribution testing, e.g. [VV17]. Here we give the first bounds of this nature for the quantum setting, showing (up to log factors) that the copy complexity for state certification using nonadaptive incoherent measurements is essentially given by the copy complexity for mixedness testing times the fidelity between $\sigma$ and the maximally mixed state. Surprisingly, our bound differs substantially from instance optimal bounds for the classical problem, demonstrating a qualitative difference between the two settings., Comment: 52 pages, 1 figure, v2: refined exposition
- Published
- 2021
40. Symmetric Sparse Boolean Matrix Factorization and Applications
- Author
Chen, Sitan, Song, Zhao, Tao, Runzhou, Zhang, Ruizhe, Chen, Sitan, Song, Zhao, Tao, Runzhou, and Zhang, Ruizhe
- Abstract
In this work, we study a variant of nonnegative matrix factorization where we wish to find a symmetric factorization of a given input matrix into a sparse, Boolean matrix. Formally speaking, given $\mathbf{M}\in\mathbb{Z}^{m\times m}$, we want to find $\mathbf{W}\in\{0,1\}^{m\times r}$ such that $\| \mathbf{M} - \mathbf{W}\mathbf{W}^\top \|_0$ is minimized among all $\mathbf{W}$ for which each row is $k$-sparse. This question turns out to be closely related to a number of questions like recovering a hypergraph from its line graph, as well as reconstruction attacks for private neural network training. As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks: $\mathbf{M} = \mathbf{W}\mathbf{W}^{\top}$ for $\mathbf{W}$ a random Boolean matrix with $k$-sparse rows, and the goal is to recover $\mathbf{W}$ up to column permutation. Equivalently, this can be thought of as recovering a uniformly random $k$-uniform hypergraph from its line graph. Our main result is a polynomial-time algorithm for this problem based on bootstrapping higher-order information about $\mathbf{W}$ and then decomposing an appropriate tensor. The key ingredient in our analysis, which may be of independent interest, is to show that such a matrix $\mathbf{W}$ has full column rank with high probability as soon as $m = \widetilde{\Omega}(r)$, which we do using tools from Littlewood-Offord theory and estimates for binary Krawtchouk polynomials., Comment: 33 pages, to appear in Innovations in Theoretical Computer Science (ITCS 2022), v2: updated refs
- Published
- 2021
41. Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability
- Author
Chen, Sitan, Koehler, Frederic, Moitra, Ankur, Yau, Morris, Chen, Sitan, Koehler, Frederic, Moitra, Ankur, and Yau, Morris
- Abstract
In this paper we revisit some classic problems on classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate $\eta$. In a recent work, Diakonikolas, Goulekakis, and Tzamos resolved a long-standing problem by giving the first efficient algorithm for learning to accuracy $\eta + \epsilon$ for any $\epsilon > 0$. However, their algorithm outputs a complicated hypothesis, which partitions space into $\text{poly}(d,1/\epsilon)$ regions. Here we give a much simpler algorithm and in the process resolve a number of outstanding open questions: (1) We give the first proper learner for Massart halfspaces that achieves $\eta + \epsilon$. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + \epsilon$. Moreover we study generalized linear models where $\mathbb{E}[Y|\mathbf{X}] = \sigma(\langle \mathbf{w}^*, \mathbf{X}\rangle)$ for any odd, monotone, and Lipschitz function $\sigma$. This family includes the previously mentioned halfspace models as a special case, but is much richer and includes other fundamental models like logistic regression. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Our algorithms are based on a small set of core recipes for learning to classify in the presence of misspecification. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties., Comment: 52 pages, v2: updated references
- Published
- 2020
42. Learning Polynomials of Few Relevant Dimensions
- Author
Chen, Sitan, Meka, Raghu, Chen, Sitan, and Meka, Raghu
- Abstract
Polynomial regression is a basic primitive in learning and statistics. In its most basic form the goal is to fit a degree $d$ polynomial to a response variable $y$ in terms of an $n$-dimensional input vector $x$. This is extremely well-studied with many applications and has sample and runtime complexity $\Theta(n^d)$. Can one achieve better runtime if the intrinsic dimension of the data is much smaller than the ambient dimension $n$? Concretely, we are given samples $(x,y)$ where $y$ is a degree at most $d$ polynomial in an unknown $r$-dimensional projection (the relevant dimensions) of $x$. This can be seen both as a generalization of phase retrieval and as a special case of learning multi-index models where the link function is an unknown low-degree polynomial. Note that without distributional assumptions, this is at least as hard as junta learning. In this work we consider the important case where the covariates are Gaussian. We give an algorithm that learns the polynomial within accuracy $\epsilon$ with sample complexity that is roughly $N = O_{r,d}(n \log^2(1/\epsilon) (\log n)^d)$ and runtime $O_{r,d}(N n^2)$. Prior to our work, no such results were known even for the case of $r=1$. We introduce a new filtered PCA approach to get a warm start for the true subspace and use geodesic SGD to boost to arbitrary accuracy; our techniques may be of independent interest, especially for problems dealing with subspace recovery or analyzing SGD on manifolds., Comment: 64 pages
- Published
- 2020
43. Entanglement is Necessary for Optimal Quantum Property Testing
- Author
Bubeck, Sebastien, Chen, Sitan, Li, Jerry, Bubeck, Sebastien, Chen, Sitan, and Li, Jerry
- Abstract
There has been a surge of progress in recent years in developing algorithms for testing and learning quantum states that achieve optimal copy complexity. Unfortunately, they require the use of entangled measurements across many copies of the underlying state and thus remain outside the realm of what is currently experimentally feasible. A natural question is whether one can match the copy complexity of such algorithms using only independent---but possibly adaptively chosen---measurements on individual copies. We answer this in the negative for arguably the most basic quantum testing problem: deciding whether a given $d$-dimensional quantum state is equal to or $\epsilon$-far in trace distance from the maximally mixed state. While it is known how to achieve optimal $O(d/\epsilon^2)$ copy complexity using entangled measurements, we show that with independent measurements, $\Omega(d^{4/3}/\epsilon^2)$ is necessary, even if the measurements are chosen adaptively. This resolves a question of Wright. To obtain this lower bound, we develop several new techniques, including a chain-rule style proof of Paninski's lower bound for classical uniformity testing, which may be of independent interest., Comment: 31 pages, comments welcome
- Published
- 2020
44. Algorithmic Foundations for the Diffraction Limit
- Author
Chen, Sitan, Moitra, Ankur, Chen, Sitan, and Moitra, Ankur
- Abstract
For more than a century and a half it has been widely-believed (but was never rigorously shown) that the physics of diffraction imposes certain fundamental limits on the resolution of an optical system. However our understanding of what exactly can and cannot be resolved has never risen above heuristic arguments which, even worse, appear contradictory. In this work we remedy this gap by studying the diffraction limit as a statistical inverse problem and, based on connections to provable algorithms for learning mixture models, we rigorously prove upper and lower bounds on the statistical and algorithmic complexity needed to resolve closely spaced point sources. In particular we show that there is a phase transition where the sample complexity goes from polynomial to exponential. Surprisingly, we show that this does not occur at the Abbe limit, which has long been presumed to be the true diffraction limit., Comment: 55 pages, 5 figures, v2: improved lower bound going beyond the Abbe limit
- Published
- 2020
45. Learning Structured Distributions From Untrusted Batches: Faster and Simpler
- Author
Chen, Sitan, Li, Jerry, Moitra, Ankur, Chen, Sitan, Li, Jerry, and Moitra, Ankur
- Abstract
We revisit the problem of learning from untrusted batches introduced by Qiao and Valiant [QV17]. Recently, Jain and Orlitsky [JO19] gave a simple semidefinite programming approach based on the cut-norm that achieves essentially information-theoretically optimal error in polynomial time. Concurrently, Chen et al. [CLM19] considered a variant of the problem where $\mu$ is assumed to be structured, e.g. log-concave, monotone hazard rate, $t$-modal, etc. In this case, it is possible to achieve the same error with sample complexity sublinear in $n$, and they exhibited a quasi-polynomial time algorithm for doing so using Haar wavelets. In this paper, we find an appealing way to synthesize the techniques of [JO19] and [CLM19] to give the best of both worlds: an algorithm which runs in polynomial time and can exploit structure in the underlying distribution to achieve sublinear sample complexity. Along the way, we simplify the approach of [JO19] by avoiding the need for SDP rounding and giving a more direct interpretation of it through the lens of soft filtering, a powerful recent technique in high-dimensional robust estimation. We validate the usefulness of our algorithms in preliminary experimental evaluations., Comment: 37 pages, version 2 includes experiments
- Published
- 2020
46. On InstaHide, Phase Retrieval, and Sparse Matrix Factorization
- Author
Chen, Sitan, Li, Xiaoxiao, Song, Zhao, Zhuo, Danyang, Chen, Sitan, Li, Xiaoxiao, Song, Zhao, and Zhuo, Danyang
- Abstract
In this work, we examine the security of InstaHide, a scheme recently proposed by [Huang, Song, Li and Arora, ICML'20] for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible hardness assumption and assuming the distributions generating the public and private data satisfy certain properties. We show that the answer to this appears to be quite subtle and closely related to the average-case complexity of a new multi-task, missing-data version of the classic problem of phase retrieval. Motivated by this connection, we design a provable algorithm that can recover private vectors using only the public vectors and synthetic vectors generated by InstaHide, under the assumption that the private and public vectors are isotropic Gaussian., Comment: 30 pages, to appear in ICLR 2021, v2: updated discussion of follow-up work
- Published
- 2020
47. Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination
- Author
Chen, Sitan, Koehler, Frederic, Moitra, Ankur, Yau, Morris, Chen, Sitan, Koehler, Frederic, Moitra, Ankur, and Yau, Morris
- Abstract
In this work we revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits, from the perspective of adversarial robustness. Existing works in algorithmic robust statistics make strong distributional assumptions that ensure that the input data is evenly spread out or comes from a nice generative model. Is it possible to achieve strong robustness guarantees even without distributional assumptions altogether, where the sequence of tasks we are asked to solve is adaptively and adversarially chosen? We answer this question in the affirmative for both linear regression and contextual bandits. In fact our algorithms succeed where conventional methods fail. In particular we show strong lower bounds against Huber regression and more generally any convex M-estimator. Our approach is based on a novel alternating minimization scheme that interleaves ordinary least-squares with a simple convex program that finds the optimal reweighting of the distribution under a spectral constraint. Our results obtain essentially optimal dependence on the contamination level $\eta$, reach the optimal breakdown point, and naturally apply to infinite dimensional settings where the feature vectors are represented implicitly via a kernel map., Comment: 66 pages, 1 figure, v3: refined exposition and improved rates
- Published
- 2020
48. Learning Deep ReLU Networks Is Fixed-Parameter Tractable
- Author
Chen, Sitan, Klivans, Adam R., Meka, Raghu, Chen, Sitan, Klivans, Adam R., and Meka, Raghu
- Abstract
We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs and obtain the first nontrivial results for networks of depth more than two. We give an algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters. Our bounds depend on the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network (we show that some dependence on the Lipschitz constant is necessary). We also give a bound that is doubly exponential in the size of the network but is independent of spectral norm. These results provably cannot be obtained using gradient-based methods and give the first example of a class of efficiently learnable neural networks that gradient descent will fail to learn. In contrast, prior work for learning networks of depth three or higher requires exponential time in the ambient dimension, even when the above parameters are bounded by a constant. Additionally, all prior work for the depth-two case requires well-conditioned weights and/or positive coefficients to obtain efficient run-times. Our algorithm does not require these assumptions. Our main technical tool is a type of filtered PCA that can be used to iteratively recover an approximate basis for the subspace spanned by the hidden units in the first layer. Our analysis leverages new structural results on lattice polynomials from tropical geometry., Comment: 39 pages
- Published
- 2020
49. Learning Mixtures of Linear Regressions in Subexponential Time via Fourier Moments
- Author
Chen, Sitan, Li, Jerry, Song, Zhao, Chen, Sitan, Li, Jerry, and Song, Zhao
- Abstract
We consider the problem of learning a mixture of linear regressions (MLRs). An MLR is specified by $k$ nonnegative mixing weights $p_1, \ldots, p_k$ summing to $1$, and $k$ unknown regressors $w_1,...,w_k\in\mathbb{R}^d$. A sample from the MLR is drawn by sampling $i$ with probability $p_i$, then outputting $(x, y)$ where $y = \langle x, w_i \rangle + \eta$, where $\eta\sim\mathcal{N}(0,\varsigma^2)$ for noise rate $\varsigma$. Mixtures of linear regressions are a popular generative model and have been studied extensively in machine learning and theoretical computer science. However, all previous algorithms for learning the parameters of an MLR require running time and sample complexity scaling exponentially with $k$. In this paper, we give the first algorithm for learning an MLR that runs in time which is sub-exponential in $k$. Specifically, we give an algorithm which runs in time $\widetilde{O}(d)\cdot\exp(\widetilde{O}(\sqrt{k}))$ and outputs the parameters of the MLR to high accuracy, even in the presence of nontrivial regression noise. We demonstrate a new method that we call "Fourier moment descent" which uses univariate density estimation and low-degree moments of the Fourier transform of suitable univariate projections of the MLR to iteratively refine our estimate of the parameters. To the best of our knowledge, these techniques have never been used in the context of high dimensional distribution learning, and may be of independent interest. We also show that our techniques can be used to give a sub-exponential time algorithm for learning mixtures of hyperplanes, a natural hard instance of the subspace clustering problem., Comment: 83 pages, 1 figure
- Published
- 2019
50. Efficiently Learning Structured Distributions from Untrusted Batches
- Author
Chen, Sitan, Li, Jerry, Moitra, Ankur, Chen, Sitan, Li, Jerry, and Moitra, Ankur
- Abstract
We study the problem, introduced by Qiao and Valiant, of learning from untrusted batches. Here, we assume $m$ users, all of whom have samples from some underlying distribution $p$ over $1, \ldots, n$. Each user sends a batch of $k$ i.i.d. samples from this distribution; however an $\epsilon$-fraction of users are untrustworthy and can send adversarially chosen responses. The goal is then to learn $p$ in total variation distance. When $k = 1$ this is the standard robust univariate density estimation setting and it is well-understood that $\Omega (\epsilon)$ error is unavoidable. Suprisingly, Qiao and Valiant gave an estimator which improves upon this rate when $k$ is large. Unfortunately, their algorithms run in time exponential in either $n$ or $k$. We first give a sequence of polynomial time algorithms whose estimation error approaches the information-theoretically optimal bound for this problem. Our approach is based on recent algorithms derived from the sum-of-squares hierarchy, in the context of high-dimensional robust estimation. We show that algorithms for learning from untrusted batches can also be cast in this framework, but by working with a more complicated set of test functions. It turns out this abstraction is quite powerful and can be generalized to incorporate additional problem specific constraints. Our second and main result is to show that this technology can be leveraged to build in prior knowledge about the shape of the distribution. Crucially, this allows us to reduce the sample complexity of learning from untrusted batches to polylogarithmic in $n$ for most natural classes of distributions, which is important in many applications. To do so, we demonstrate that these sum-of-squares algorithms for robust mean estimation can be made to handle complex combinatorial constraints (e.g. those arising from VC theory), which may be of independent technical interest., Comment: 46 pages
- Published
- 2019
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.