Author: "Cherapanamjeri, Yeshwanth" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cherapanamjeri, Yeshwanth"' showing total 36 results

Start Over Author "Cherapanamjeri, Yeshwanth"

36 results on '"Cherapanamjeri, Yeshwanth"'

1. How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

Author: Daras, Giannis, Cherapanamjeri, Yeshwanth, and Daskalakis, Constantinos
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than $80$ models on data with different corruption levels across three datasets ranging from $30,000$ to $\approx 1.3$M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~$10\%$ of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments., Comment: Work in progress
Published: 2024

2. Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Author: Hsu, Aliyah R., Zhou, Georgia, Cherapanamjeri, Yeshwanth, Huang, Yaxuan, Odisho, Anobel Y., Carroll, Peter R., and Yu, Bin
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Automated mechanistic interpretation research has attracted great interest due to its potential to scale explanations of neural network internals to large models. Existing automated circuit discovery work relies on activation patching or its approximations to identify subgraphs in models for specific tasks (circuits). They often suffer from slow runtime, approximation errors, and specific requirements of metrics, such as non-zero gradients. In this work, we introduce contextual decomposition for transformers (CD-T) to build interpretable circuits in large language models. CD-T can produce circuits of arbitrary level of abstraction, and is the first able to produce circuits as fine-grained as attention heads at specific sequence positions efficiently. CD-T consists of a set of mathematical equations to isolate contribution of model features. Through recursively computing contribution of all nodes in a computational graph of a model using CD-T followed by pruning, we are able to reduce circuit discovery runtime from hours to seconds compared to state-of-the-art baselines. On three standard circuit evaluation datasets (indirect object identification, greater-than comparisons, and docstring completion), we demonstrate that CD-T outperforms ACDC and EAP by better recovering the manual circuits with an average of 97% ROC AUC under low runtimes. In addition, we provide evidence that faithfulness of CD-T circuits is not due to random chance by showing our circuits are 80% more faithful than random circuits of up to 60% of the original model size. Finally, we show CD-T circuits are able to perfectly replicate original models' behavior (faithfulness $ = 1$) using fewer nodes than the baselines for all tasks. Our results underscore the great promise of CD-T for efficient automated mechanistic interpretability, paving the way for new insights into the workings of large language models.
Published: 2024

3. Statistical Barriers to Affine-equivariant Estimation

Author: Chen, Zihao and Cherapanamjeri, Yeshwanth
Subjects: Mathematics - Statistics Theory, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning
Abstract: We investigate the quantitative performance of affine-equivariant estimators for robust mean estimation. As a natural stability requirement, the construction of such affine-equivariant estimators has been extensively studied in the statistics literature. We quantitatively evaluate these estimators under two outlier models which have been the subject of much recent work: the heavy-tailed and adversarial corruption settings. We establish lower bounds which show that affine-equivariance induces a strict degradation in recovery error with quantitative rates degrading by a factor of $\sqrt{d}$ in both settings. We find that classical estimators such as the Tukey median (Tukey '75) and Stahel-Donoho estimator (Stahel '81 and Donoho '82) are either quantitatively sub-optimal even within the class of affine-equivariant estimators or lack any quantitative guarantees. On the other hand, recent estimators with strong quantitative guarantees are not affine-equivariant or require additional distributional assumptions to achieve it. We remedy this by constructing a new affine-equivariant estimator which nearly matches our lower bound. Our estimator is based on a novel notion of a high-dimensional median which may be of independent interest. Notably, our results are applicable more broadly to any estimator whose performance is evaluated in the Mahalanobis norm which, for affine-equivariant estimators, corresponds to an evaluation in Euclidean norm on isotropic distributions.
Published: 2023

4. Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making

Author: Hsu, Aliyah R., Cherapanamjeri, Yeshwanth, Park, Briton, Naumann, Tristan, Odisho, Anobel Y., and Yu, Bin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability. We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models for other applications in medicine and in more critical domains.
Published: 2023

5. Optimal PAC Bounds Without Uniform Convergence

Author: Aden-Ali, Ishaq, Cherapanamjeri, Yeshwanth, Shetty, Abhishek, and Zhivotovskiy, Nikita
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Mathematics - Statistics Theory
Abstract: In statistical learning theory, determining the sample complexity of realizable binary classification for VC classes was a long-standing open problem. The results of Simon and Hanneke established sharp upper bounds in this setting. However, the reliance of their argument on the uniform convergence principle limits its applicability to more general learning settings such as multiclass classification. In this paper, we address this issue by providing optimal high probability risk bounds through a framework that surpasses the limitations of uniform convergence arguments. Our framework converts the leave-one-out error of permutation invariant predictors into high probability risk bounds. As an application, by adapting the one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth, we propose an algorithm that achieves an optimal PAC bound for binary classification. Specifically, our result shows that certain aggregations of one-inclusion graph algorithms are optimal, addressing a variant of a classic question posed by Warmuth. We further instantiate our framework in three settings where uniform convergence is provably suboptimal. For multiclass classification, we prove an optimal risk bound that scales with the one-inclusion hypergraph density of the class, addressing the suboptimality of the analysis of Daniely and Shalev-Shwartz. For partial hypothesis classification, we determine the optimal sample complexity bound, resolving a question posed by Alon, Hanneke, Holzman, and Moran. For realizable bounded regression with absolute loss, we derive an optimal risk bound that relies on a modified version of the scale-sensitive dimension, refining the results of Bartlett and Long. Our rates surpass standard uniform convergence-based results due to the smaller complexity measure in our risk bound., Comment: 27 pages
Published: 2023

6. Robust Algorithms on Adaptive Inputs from Bounded Adversaries

Author: Cherapanamjeri, Yeshwanth, Silwal, Sandeep, Woodruff, David P., Zhang, Fred, Zhang, Qiuyi, and Zhou, Samson
Subjects: Computer Science - Data Structures and Algorithms
Abstract: We study dynamic algorithms robust to adaptive input generated from sources with bounded capabilities, such as sparsity or limited interaction. For example, we consider robust linear algebraic algorithms when the updates to the input are sparse but given by an adversary with access to a query oracle. We also study robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner, but the number of interactions between the adversary and the algorithm is bounded. We first recall a unified framework of [HKM+20, BKM+22, ACSS23] for answering $Q$ adaptive queries that incurs $\widetilde{\mathcal{O}}(\sqrt{Q})$ overhead in space, which is roughly a quadratic improvement over the na\"{i}ve implementation, and only incurs a logarithmic overhead in query time. Although the general framework has diverse applications in machine learning and data science, such as adaptive distance estimation, kernel density estimation, linear regression, range queries, and point queries and serves as a preliminary benchmark, we demonstrate even better algorithmic improvements for (1) reducing the pre-processing time for adaptive distance estimation and (2) permitting an unlimited number of adaptive queries for kernel density estimation. Finally, we complement our theoretical results with additional empirical evaluations.
Published: 2023

7. The One-Inclusion Graph Algorithm is not Always Optimal

Author: Aden-Ali, Ishaq, Cherapanamjeri, Yeshwanth, Shetty, Abhishek, and Zhivotovskiy, Nikita
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Mathematics - Statistics Theory
Abstract: The one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth achieves an optimal in-expectation risk bound in the standard PAC classification setup. In one of the first COLT open problems, Warmuth conjectured that this prediction strategy always implies an optimal high probability bound on the risk, and hence is also an optimal PAC algorithm. We refute this conjecture in the strongest sense: for any practically interesting Vapnik-Chervonenkis class, we provide an in-expectation optimal one-inclusion graph algorithm whose high probability risk bound cannot go beyond that implied by Markov's inequality. Our construction of these poorly performing one-inclusion graph algorithms uses Varshamov-Tenengolts error correcting codes. Our negative result has several implications. First, it shows that the same poor high-probability performance is inherited by several recent prediction strategies based on generalizations of the one-inclusion graph algorithm. Second, our analysis shows yet another statistical problem that enjoys an estimator that is provably optimal in expectation via a leave-one-out argument, but fails in the high-probability regime. This discrepancy occurs despite the boundedness of the binary loss for which arguments based on concentration inequalities often provide sharp high probability risk bounds., Comment: 16 pages
Published: 2022

8. Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time

Author: Cherapanamjeri, Yeshwanth, Silwal, Sandeep, Woodruff, David P., and Zhou, Samson
Subjects: Computer Science - Data Structures and Algorithms
Abstract: We study fundamental problems in linear algebra, such as finding a maximal linearly independent subset of rows or columns (a basis), solving linear regression, or computing a subspace embedding. For these problems, we consider input matrices $\mathbf{A}\in\mathbb{R}^{n\times d}$ with $n > d$. The input can be read in $\text{nnz}(\mathbf{A})$ time, which denotes the number of nonzero entries of $\mathbf{A}$. In this paper, we show that beyond the time required to read the input matrix, these fundamental linear algebra problems can be solved in $d^{\omega}$ time, i.e., where $\omega \approx 2.37$ is the current matrix-multiplication exponent. To do so, we introduce a constant-factor subspace embedding with the optimal $m=\mathcal{O}(d)$ number of rows, and which can be applied in time $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}{\alpha}\right) + d^{2 + \alpha}\text{poly}(\log d)$ for any trade-off parameter $\alpha>0$, tightening a recent result by Chepurko et. al. [SODA 2022] that achieves an $\exp(\text{poly}(\log\log n))$ distortion with $m=d\cdot\text{poly}(\log\log d)$ rows in $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}{\alpha}+d^{2+\alpha+o(1)}\right)$ time. Our subspace embedding uses a recently shown property of stacked Subsampled Randomized Hadamard Transforms (SRHT), which actually increase the input dimension, to "spread" the mass of an input vector among a large number of coordinates, followed by random sampling. To control the effects of random sampling, we use fast semidefinite programming to reweight the rows. We then use our constant-factor subspace embedding to give the first optimal runtime algorithms for finding a maximal linearly independent subset of columns, regression, and leverage score sampling. To do so, we also introduce a novel subroutine that iteratively grows a set of independent rows, which may be of independent interest., Comment: SODA 2023
Published: 2022

9. What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

Author: Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos, Ilyas, Andrew, and Zampetakis, Manolis
Subjects: Mathematics - Statistics Theory, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.
Published: 2022

10. Estimation of Standard Auction Models

Author: Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos, Ilyas, Andrew, and Zampetakis, Manolis
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Data Structures and Algorithms, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: We provide efficient estimation methods for first- and second-price auctions under independent (asymmetric) private values and partial observability. Given a finite set of observations, each comprising the identity of the winner and the price they paid in a sequence of identical auctions, we provide algorithms for non-parametrically estimating the bid distribution of each bidder, as well as their value distributions under equilibrium assumptions. We provide finite-sample estimation bounds which are uniform in that their error rates do not depend on the bid/value distributions being estimated. Our estimation guarantees advance a body of work in Econometrics wherein only identification results have been obtained, unless the setting is symmetric, parametric, or all bids are observable. Our guarantees also provide computationally and statistically effective alternatives to classical techniques from reliability theory. Finally, our results are immediately applicable to Dutch and English auctions.
Published: 2022

11. Uniform Approximations for Randomized Hadamard Transforms with Applications

Author: Cherapanamjeri, Yeshwanth and Nelson, Jelani
Subjects: Computer Science - Machine Learning, Computer Science - Computational Geometry, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: Randomized Hadamard Transforms (RHTs) have emerged as a computationally efficient alternative to the use of dense unstructured random matrices across a range of domains in computer science and machine learning. For several applications such as dimensionality reduction and compressed sensing, the theoretical guarantees for methods based on RHTs are comparable to approaches using dense random matrices with i.i.d.\ entries. However, several such applications are in the low-dimensional regime where the number of rows sampled from the matrix is rather small. Prior arguments are not applicable to the high-dimensional regime often found in machine learning applications like kernel approximation. Given an ensemble of RHTs with Gaussian diagonals, $\{M^i\}_{i = 1}^m$, and any $1$-Lipschitz function, $f: \mathbb{R} \to \mathbb{R}$, we prove that the average of $f$ over the entries of $\{M^i v\}_{i = 1}^m$ converges to its expectation uniformly over $\| v \| \leq 1$ at a rate comparable to that obtained from using truly Gaussian matrices. We use our inequality to then derive improved guarantees for two applications in the high-dimensional regime: 1) kernel approximation and 2) distance estimation. For kernel approximation, we prove the first \emph{uniform} approximation guarantees for random features constructed through RHTs lending theoretical justification to their empirical success while for distance estimation, our convergence result implies data structures with improved runtime guarantees over previous work by the authors. We believe our general inequality is likely to find use in other applications., Comment: STOC 2022
Published: 2022

12. Terminal Embeddings in Sublinear Time

Author: Cherapanamjeri, Yeshwanth and Nelson, Jelani
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Computational Geometry, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . \end{equation*} When $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside of prior work is that evaluating their embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in~$m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-\Theta(\epsilon^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.
Published: 2021
Full Text: View/download PDF

13. Adversarial Examples in Multi-Layer Random ReLU Networks

Author: Bartlett, Peter L., Bubeck, Sébastien, and Cherapanamjeri, Yeshwanth
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
Published: 2021

14. A single gradient step finds adversarial examples on random two-layers neural networks

Author: Bubeck, Sébastien, Cherapanamjeri, Yeshwanth, Gidel, Gauthier, and Combes, Rémi Tachet des
Subjects: Computer Science - Machine Learning, Computer Science - Cryptography and Security, Statistics - Machine Learning
Abstract: Daniely and Schacham recently showed that gradient descent finds adversarial examples on random undercomplete two-layers ReLU neural networks. The term "undercomplete" refers to the fact that their proof only holds when the number of neurons is a vanishing fraction of the ambient dimension. We extend their result to the overcomplete case, where the number of neurons is larger than the dimension (yet also subexponential in the dimension). In fact we prove that a single step of gradient descent suffices. We also show this result for any subexponential width random neural network with smooth activation function., Comment: Added a comment about universal adversarial perturbations. 18 pages, 7 figures
Published: 2021

15. Optimal Mean Estimation without a Variance

Author: Cherapanamjeri, Yeshwanth, Tripuraneni, Nilesh, Bartlett, Peter L., and Jordan, Michael I.
Subjects: Mathematics - Statistics Theory, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. Concretely, given a sample $\mathbf{X} = \{X_i\}_{i = 1}^n$ from a distribution $\mathcal{D}$ over $\mathbb{R}^d$ with mean $\mu$ which satisfies the following \emph{weak-moment} assumption for some ${\alpha \in [0, 1]}$: \begin{equation*} \forall \|v\| = 1: \mathbb{E}_{X \thicksim \mathcal{D}}[\lvert \langle X - \mu, v\rangle \rvert^{1 + \alpha}] \leq 1, \end{equation*} and given a target failure probability, $\delta$, our goal is to design an estimator which attains the smallest possible confidence interval as a function of $n,d,\delta$. For the specific case of $\alpha = 1$, foundational work of Lugosi and Mendelson exhibits an estimator achieving subgaussian confidence intervals, and subsequent work has led to computationally efficient versions of this estimator. Here, we study the case of general $\alpha$, and establish the following information-theoretic lower bound on the optimal attainable confidence interval: \begin{equation*} \Omega \left(\sqrt{\frac{d}{n}} + \left(\frac{d}{n}\right)^{\frac{\alpha}{(1 + \alpha)}} + \left(\frac{\log 1 / \delta}{n}\right)^{\frac{\alpha}{(1 + \alpha)}}\right). \end{equation*} Moreover, we devise a computationally-efficient estimator which achieves this lower bound., Comment: Fixed typographical errors in Theorem 1.2, Lemmas 4.3 and C.8
Published: 2020

16. On Adaptive Distance Estimation

Author: Cherapanamjeri, Yeshwanth and Nelson, Jelani
Subjects: Computer Science - Data Structures and Algorithms
Abstract: We provide a static data structure for distance estimation which supports {\it adaptive} queries. Concretely, given a dataset $X = \{x_i\}_{i = 1}^n$ of $n$ points in $\mathbb{R}^d$ and $0 < p \leq 2$, we construct a randomized data structure with low memory consumption and query time which, when later given any query point $q \in \mathbb{R}^d$, outputs a $(1+\epsilon)$-approximation of $\lVert q - x_i \rVert_p$ with high probability for all $i\in[n]$. The main novelty is our data structure's correctness guarantee holds even when the sequence of queries can be chosen adaptively: an adversary is allowed to choose the $j$th query point $q_j$ in a way that depends on the answers reported by the data structure for $q_1,\ldots,q_{j-1}$. Previous randomized Monte Carlo methods do not provide error guarantees in the setting of adaptively chosen queries. Our memory consumption is $\tilde O((n+d)d/\epsilon^2)$, slightly more than the $O(nd)$ required to store $X$ in memory explicitly, but with the benefit that our time to answer queries is only $\tilde O(\epsilon^{-2}(n + d))$, much faster than the naive $\Theta(nd)$ time obtained from a linear scan in the case of $n$ and $d$ very large. Here $\tilde O$ hides $\log(nd/\epsilon)$ factors. We discuss applications to nearest neighbor search and nonparametric estimation. Our method is simple and likely to be applicable to other domains: we describe a generic approach for transforming randomized Monte Carlo data structures which do not support adaptive queries to ones that do, and show that for the problem at hand, it can be applied to standard nonadaptive solutions to $\ell_p$ norm estimation with negligible overhead in query time and a factor $d$ overhead in memory., Comment: Minor correction in proof of Lemma B.6
Published: 2020

17. Optimal Robust Linear Regression in Nearly Linear Time

Author: Cherapanamjeri, Yeshwanth, Aras, Efe, Tripuraneni, Nilesh, Jordan, Michael I., Flammarion, Nicolas, and Bartlett, Peter L.
Subjects: Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning
Abstract: We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = \langle X,w^* \rangle + \epsilon$ (with $X \in \mathbb{R}^d$ and $\epsilon$ independent), in which an $\eta$ fraction of the samples have been adversarially corrupted. We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $\mathbb{E} [XX^\top]$ has bounded condition number and $\epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $\epsilon$ is sub-Gaussian. In both settings, our estimators: (a) Achieve optimal sample complexities and recovery guarantees up to log factors and (b) Run in near linear time ($\tilde{O}(nd / \eta^6)$). Prior to our work, polynomial time algorithms achieving near optimal sample complexities were only known in the setting where $X$ is Gaussian with identity covariance and $\epsilon$ is Gaussian, and no linear time estimators were known for robust linear regression in any setting. Our estimators and their analysis leverage recent developments in the construction of faster algorithms for robust mean estimation to improve runtimes, and refined concentration of measure arguments alongside Gaussian rounding techniques to improve statistical sample complexities.
Published: 2020

18. List Decodable Mean Estimation in Nearly Linear Time

Author: Cherapanamjeri, Yeshwanth, Mohanty, Sidhanth, and Yau, Morris
Subjects: Computer Science - Data Structures and Algorithms
Abstract: Learning from data in the presence of outliers is a fundamental problem in statistics. Until recently, no computationally efficient algorithms were known to compute the mean of a high dimensional distribution under natural assumptions in the presence of even a small fraction of outliers. In this paper, we consider robust statistics in the presence of overwhelming outliers where the majority of the dataset is introduced adversarially. With only an $\alpha < 1/2$ fraction of "inliers" (clean data) the mean of a distribution is unidentifiable. However, in their influential work, [CSV17] introduces a polynomial time algorithm recovering the mean of distributions with bounded covariance by outputting a succinct list of $O(1/\alpha)$ candidate solutions, one of which is guaranteed to be close to the true distributional mean; a direct analog of 'List Decoding' in the theory of error correcting codes. In this work, we develop an algorithm for list decodable mean estimation in the same setting achieving up to constants the information theoretically optimal recovery, optimal sample complexity, and in nearly linear time up to polylogarithmic factors in dimension. Our conceptual innovation is to design a descent style algorithm on a nonconvex landscape, iteratively removing minima to generate a succinct list of solutions. Our runtime bottleneck is a saddle-point optimization for which we design custom primal dual solvers for generalized packing and covering SDP's under Ky-Fan norms, which may be of independent interest., Comment: Minor corrections in Appendix A.3 and Algorithm 2
Published: 2020

19. Algorithms for Heavy-Tailed Statistics: Regression, Covariance Estimation, and Beyond

Author: Cherapanamjeri, Yeshwanth, Hopkins, Samuel B., Kathuria, Tarun, Raghavendra, Prasad, and Tripuraneni, Nilesh
Subjects: Mathematics - Statistics Theory, Computer Science - Data Structures and Algorithms
Abstract: We study efficient algorithms for linear regression and covariance estimation in the absence of Gaussian assumptions on the underlying distributions of samples, making assumptions instead about only finitely-many moments. We focus on how many samples are needed to do estimation and regression with high accuracy and exponentially-good success probability. For covariance estimation, linear regression, and several other problems, estimators have recently been constructed with sample complexities and rates of error matching what is possible when the underlying distribution is Gaussian, but algorithms for these estimators require exponential time. We narrow the gap between the Gaussian and heavy-tailed settings for polynomial-time estimators with: 1. A polynomial-time estimator which takes $n$ samples from a random vector $X \in R^d$ with covariance $\Sigma$ and produces $\hat{\Sigma}$ such that in spectral norm $\|\hat{\Sigma} - \Sigma \|_2 \leq \tilde{O}(d^{3/4}/\sqrt{n})$ w.p. $1-2^{-d}$. The information-theoretically optimal error bound is $\tilde{O}(\sqrt{d/n})$; previous approaches to polynomial-time algorithms were stuck at $\tilde{O}(d/\sqrt{n})$. 2. A polynomial-time algorithm which takes $n$ samples $(X_i,Y_i)$ where $Y_i = \langle u,X_i \rangle + \varepsilon_i$ and produces $\hat{u}$ such that the loss $\|u - \hat{u}\|^2 \leq O(d/n)$ w.p. $1-2^{-d}$ for any $n \geq d^{3/2} \log(d)^{O(1)}$. This (information-theoretically optimal) error is achieved by inefficient algorithms for any $n \gg d$; previous polynomial-time algorithms suffer loss $\Omega(d^2/n)$ and require $n \gg d^2$. Our algorithms use degree-$8$ sum-of-squares semidefinite programs. We offer preliminary evidence that improving these rates of error in polynomial time is not possible in the median of means framework our algorithms employ.
Published: 2019

20. Testing Markov Chains without Hitting

Author: Cherapanamjeri, Yeshwanth and Bartlett, Peter L.
Subjects: Mathematics - Statistics Theory, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We study the problem of identity testing of markov chains. In this setting, we are given access to a single trajectory from a markov chain with unknown transition matrix $Q$ and the goal is to determine whether $Q = P$ for some known matrix $P$ or $\text{Dist}(P, Q) \geq \epsilon$ where $\text{Dist}$ is suitably defined. In recent work by Daskalakis, Dikkala and Gravin, 2018, it was shown that it is possible to distinguish between the two cases provided the length of the observed trajectory is at least super-linear in the hitting time of $P$ which may be arbitrarily large. In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain. Our algorithm is based on combining classical ideas from approximation algorithms with techniques for the spectral analysis of markov chains.
Published: 2019

21. Fast Mean Estimation with Sub-Gaussian Rates

Author: Cherapanamjeri, Yeshwanth, Flammarion, Nicolas, and Bartlett, Peter L.
Subjects: Mathematics - Statistics Theory, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^4+n^2d)$ for $n$ i.i.d.~samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.
Published: 2019

22. Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time

Author: Cherapanamjeri, Yeshwanth, primary, Silwal, Sandeep, additional, Woodruff, David P., additional, and Zhou, Samson, additional
Published: 2023
Full Text: View/download PDF

23. Thresholding based Efficient Outlier Robust PCA

Author: Cherapanamjeri, Yeshwanth, Jain, Prateek, and Netrapalli, Praneeth
Subjects: Computer Science - Learning
Abstract: We consider the problem of outlier robust PCA (OR-PCA) where the goal is to recover principal directions despite the presence of outlier data points. That is, given a data matrix $M^*$, where $(1-\alpha)$ fraction of the points are noisy samples from a low-dimensional subspace while $\alpha$ fraction of the points can be arbitrary outliers, the goal is to recover the subspace accurately. Existing results for \OR-PCA have serious drawbacks: while some results are quite weak in the presence of noise, other results have runtime quadratic in dimension, rendering them impractical for large scale applications. In this work, we provide a novel thresholding based iterative algorithm with per-iteration complexity at most linear in the data size. Moreover, the fraction of outliers, $\alpha$, that our method can handle is tight up to constants while providing nearly optimal computational complexity for a general noise setting. For the special case where the inliers are obtained from a low-dimensional subspace with additive Gaussian noise, we show that a modification of our thresholding based method leads to significant improvement in recovery error (of the subspace) even in the presence of a large fraction of outliers.
Published: 2017

24. Recent Developments in Robust Statistics

Author: Cherapanamjeri, Yeshwanth
Subjects: Computer science, Statistics
Abstract: The design of statistical estimators robust to outliers has been a mainstay of statistical research through the past six decades. These techniques are even more prescient in the contemporary landscape where large-scale machine learning systems are deployed in increasingly noisy and adaptive environments. In this thesis, we consider the task of building such an estimator for arguably the simplest possible statistical estimation problem -- that of mean estimation. There is surprisingly little understanding of the computational and statistical limits of estimation and the trade-offs incurred even for this relatively simple setting. We make progress on this problem along three complementary axes.Our first contribution is a simple algorithmic framework for constructing robust estimators. Our framework allows for a significant speed-up over prior approaches for mean estimation while also allowing for easy extensibility to other statistical estimation tasks where it achieves state-of-the-art performance. Secondly, we investigate the statistical boundaries of mean estimation where we demonstrate the necessary statistical degradation incurred in extremely heavy-tailed scenarios. While prior work showed that estimation could be performed as well as if one had access to Guassian data, we establish that this is no longer true when the data possesses heavier tails. We provide lower bounds which exhibit this degradation and an (efficient) algorithm matching them. Lastly, we consider the stability of these estimators to natural transformations of the data. Inspired by the empirical mean, classical work constructed estimators equivariant to affine transformations. These works, however, lacked the strong quantitative performance of more recent approaches. We demonstrate that such trade-offs are in fact necessary by constructing novel lower bounds for affine-equivariant estimators. We then show that classical estimators are quantitatively deficient even in this restricted class and devise an estimator based on a novel notion of a high-dimensional median which matches the lower bound.
Published: 2023

25. Terminal Embeddings in Sublinear Time

Author: Cherapanamjeri, Yeshwanth, primary and Nelson, Jelani, additional
Published: 2024
Full Text: View/download PDF

26. Nearly-optimal Robust Matrix Completion

Author: Cherapanamjeri, Yeshwanth, Gupta, Kartik, and Jain, Prateek
Subjects: Computer Science - Learning, Computer Science - Numerical Analysis
Abstract: In this paper, we consider the problem of Robust Matrix Completion (RMC) where the goal is to recover a low-rank matrix by observing a small number of its entries out of which a few can be arbitrarily corrupted. We propose a simple projected gradient descent method to estimate the low-rank matrix that alternately performs a projected gradient descent step and cleans up a few of the corrupted entries using hard-thresholding. Our algorithm solves RMC using nearly optimal number of observations as well as nearly optimal number of corruptions. Our result also implies significant improvement over the existing time complexity bounds for the low-rank matrix completion problem. Finally, an application of our result to the robust PCA problem (low-rank+sparse matrix separation) leads to nearly linear time (in matrix dimensions) algorithm for the same; existing state-of-the-art methods require quadratic time. Our empirical results corroborate our theoretical results and show that even for moderate sized problems, our method for robust PCA is an an order of magnitude faster than the existing methods.
Published: 2016

27. Optimal PAC Bounds without Uniform Convergence

Author: Aden-Ali, Ishaq, primary, Cherapanamjeri, Yeshwanth, additional, Shetty, Abhishek, additional, and Zhivotovskiy, Nikita, additional
Published: 2023
Full Text: View/download PDF

28. What Makes a Good Fisherman? Linear Regression under Self-Selection Bias

Author: Cherapanamjeri, Yeshwanth, primary, Daskalakis, Constantinos, additional, Ilyas, Andrew, additional, and Zampetakis, Manolis, additional
Published: 2023
Full Text: View/download PDF

29. What Makes a Good Fisherman? Linear Regression under Self-Selection Bias

Author: Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos, Ilyas, Andrew, Zampetakis, Manolis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos, Ilyas, Andrew, and Zampetakis, Manolis
Published: 2023

30. An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

Author: Hsu, Aliyah R., Cherapanamjeri, Yeshwanth, Park, Briton, Naumann, Tristan, Odisho, Anobel Y., and Yu, Bin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Pre-trained transformer models have demonstrated success across many natural language processing (NLP) tasks. In applying these models to the clinical domain, a prevailing assumption is that pre-training language models from scratch on large-scale biomedical data results in substantial improvements. We test this assumption with 4 pathology classification tasks on a corpus of 2907 prostate cancer pathology reports. We evaluate 5 transformer pre-trained models that are the same size but differ in pre-training corpora. Specifically, we analyze 3 categories of models: 1)General-domain: BERT and Turing Natural Language Representation (TNLR) models, which use general corpora for pre-training, 2)Mixed-domain: BioBERT which is obtained from BERT by including PubMed abstracts in pre-training and Clinical BioBERT which additionally includes MIMIC-III clinical notes and 3)Domain-specific: PubMedBERT which is pre-trained from scratch on PubMed abstracts. We find the mixed-domain and domain-specific models exhibit faster feature disambiguation during fine-tuning. However, the domain-specific model, PubMedBERT, can overfit to minority classes when presented with class imbalance, a common scenario in pathology report data. At the same time, the mixed-domain models are more resistant to overfitting. Our findings indicate that the use of general natural language and domain-specific corpora in pre-training serve complementary purposes for pathology report classification. The first enables resistance to overfitting when fine-tuning on an imbalanced dataset while the second allows for more accurate modelling of the fine-tuning domain. An expert evaluation is also conducted to reveal common outlier modes of each model. Our results could inform better fine-tuning practices in the clinical domain, to possibly leverage the benefits of mixed-domain models for imbalanced downstream datasets.
Published: 2023
Full Text: View/download PDF

31. Estimation of Standard Auction Models

Author: Cherapanamjeri, Yeshwanth, primary, Daskalakis, Constantinos, additional, Ilyas, Andrew, additional, and Zampetakis, Manolis, additional
Published: 2022
Full Text: View/download PDF

32. Uniform approximations for Randomized Hadamard Transforms with applications

Author: Cherapanamjeri, Yeshwanth, primary and Nelson, Jelani, additional
Published: 2022
Full Text: View/download PDF

33. Estimation of Standard Auction Models

Author: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Center for Brains, Minds, and Machines, Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos, Ilyas, Andrew, Zampetakis, Manolis, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Center for Brains, Minds, and Machines, Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos, Ilyas, Andrew, and Zampetakis, Manolis
Published: 2022

34. Terminal Embeddings in Sublinear Time

Author: Cherapanamjeri, Yeshwanth, primary and Nelson, Jelani, additional
Published: 2022
Full Text: View/download PDF

35. A single gradient step finds adversarial examples on random two-layers neural networks

Author: Bubeck, S��bastien, Cherapanamjeri, Yeshwanth, Gidel, Gauthier, and Combes, R��mi Tachet des
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Cryptography and Security, Quantitative Biology::Neurons and Cognition, Statistics - Machine Learning, Machine Learning (stat.ML), Cryptography and Security (cs.CR), Machine Learning (cs.LG)
Abstract: Daniely and Schacham recently showed that gradient descent finds adversarial examples on random undercomplete two-layers ReLU neural networks. The term "undercomplete" refers to the fact that their proof only holds when the number of neurons is a vanishing fraction of the ambient dimension. We extend their result to the overcomplete case, where the number of neurons is larger than the dimension (yet also subexponential in the dimension). In fact we prove that a single step of gradient descent suffices. We also show this result for any subexponential width random neural network with smooth activation function., Added a comment about universal adversarial perturbations. 18 pages, 7 figures
Published: 2021

36. List Decodable Mean Estimation in Nearly Linear Time

Author: Cherapanamjeri, Yeshwanth, primary, Mohanty, Sidhanth, additional, and Yau, Morris, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

36 results on '"Cherapanamjeri, Yeshwanth"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources