54 results on '"Kamath, P"'
Search Results
2. The Broader Landscape of Robustness in Algorithmic Statistics
- Author
-
Kamath, Gautam
- Subjects
Statistics - Machine Learning ,Computer Science - Cryptography and Security ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Mathematics - Statistics Theory - Abstract
The last decade has seen a number of advances in computationally efficient algorithms for statistical methods subject to robustness constraints. An estimator may be robust in a number of different ways: to contamination of the dataset, to heavy-tailed data, or in the sense that it preserves privacy of the dataset. We survey recent results in these areas with a focus on the problem of mean estimation, drawing technical and conceptual connections between the various forms of robustness, showing that the same underlying algorithmic ideas lead to computationally efficient estimators in all these settings.
- Published
- 2024
3. Distribution Learnability and Robustness
- Author
-
Ben-David, Shai, Bie, Alex, Kamath, Gautam, and Lechner, Tosca
- Subjects
Statistics - Machine Learning ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Mathematics - Statistics Theory - Abstract
We examine the relationship between learnability and robust (or agnostic) learnability for the problem of distribution learning. We show that, contrary to other learning settings (e.g., PAC learning of function classes), realizable learnability of a class of probability distributions does not imply its agnostic learnability. We go on to examine what type of data corruption can disrupt the learnability of a distribution class and what is such learnability robust against. We show that realizable learnability of a class of distributions implies its robust learnability with respect to only additive corruption, but not against subtractive corruption. We also explore related implications in the context of compression schemes and differentially private learnability., Comment: In NeurIPS 2023
- Published
- 2024
4. Private Mean Estimation with Person-Level Differential Privacy
- Author
-
Agarwal, Sushant, Kamath, Gautam, Majid, Mahbod, Mouzakis, Argyris, Silver, Rose, and Ullman, Jonathan
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We study person-level differentially private (DP) mean estimation in the case where each person holds multiple samples. DP here requires the usual notion of distributional stability when $\textit{all}$ of a person's datapoints can be modified. Informally, if $n$ people each have $m$ samples from an unknown $d$-dimensional distribution with bounded $k$-th moments, we show that \[n = \tilde \Theta\left(\frac{d}{\alpha^2 m} + \frac{d}{\alpha m^{1/2} \varepsilon} + \frac{d}{\alpha^{k/(k-1)} m \varepsilon} + \frac{d}{\varepsilon}\right)\] people are necessary and sufficient to estimate the mean up to distance $\alpha$ in $\ell_2$-norm under $\varepsilon$-differential privacy (and its common relaxations). In the multivariate setting, we give computationally efficient algorithms under approximate-DP and computationally inefficient algorithms under pure DP, and our nearly matching lower bounds hold for the most permissive case of approximate DP. Our computationally efficient estimators are based on the standard clip-and-noise framework, but the analysis for our setting requires both new algorithmic techniques and new analyses. In particular, our new bounds on the tails of sums of independent, vector-valued, bounded-moments random variables may be of interest., Comment: 72 pages, 3 figures
- Published
- 2024
5. Robustness Implies Privacy in Statistical Estimation
- Author
-
Hopkins, Samuel B., Kamath, Gautam, Majid, Mahbod, and Narayanan, Shyam
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Statistics - Machine Learning - Abstract
We study the relationship between adversarial robustness and differential privacy in high-dimensional algorithmic statistics. We give the first black-box reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental high-dimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearly-optimal polynomial-time robust estimators for the mean and covariance of high-dimensional Gaussians which are based on the Sum-of-Squares method, we design the first polynomial-time private estimators for these problems with nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also robust to a nearly optimal fraction of adversarially-corrupted samples., Comment: 90 pages, 2 tables. Appeared in STOC, 2023
- Published
- 2022
6. Calibration with Privacy in Peer Review
- Author
-
Ding, Wenxin, Kamath, Gautam, Wang, Weina, and Shah, Nihar B.
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Information Theory - Abstract
Reviewers in peer review are often miscalibrated: they may be strict, lenient, extreme, moderate, etc. A number of algorithms have previously been proposed to calibrate reviews. Such attempts of calibration can however leak sensitive information about which reviewer reviewed which paper. In this paper, we identify this problem of calibration with privacy, and provide a foundational building block to address it. Specifically, we present a theoretical study of this problem under a simplified-yet-challenging model involving two reviewers, two papers, and an MAP-computing adversary. Our main results establish the Pareto frontier of the tradeoff between privacy (preventing the adversary from inferring reviewer identity) and utility (accepting better papers), and design explicit computationally-efficient algorithms that we prove are Pareto optimal., Comment: 31 pages, 6 figures
- Published
- 2022
7. Efficient Mean Estimation with Pure Differential Privacy via a Sum-of-Squares Exponential Mechanism
- Author
-
Hopkins, Samuel B., Kamath, Gautam, and Majid, Mahbod
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Statistics - Machine Learning - Abstract
We give the first polynomial-time algorithm to estimate the mean of a $d$-variate probability distribution with bounded covariance from $\tilde{O}(d)$ independent samples subject to pure differential privacy. Prior algorithms for this problem either incur exponential running time, require $\Omega(d^{1.5})$ samples, or satisfy only the weaker concentrated or approximate differential privacy conditions. In particular, all prior polynomial-time algorithms require $d^{1+\Omega(1)}$ samples to guarantee small privacy loss with "cryptographically" high probability, $1-2^{-d^{\Omega(1)}}$, while our algorithm retains $\tilde{O}(d)$ sample complexity even in this stringent setting. Our main technique is a new approach to use the powerful Sum of Squares method (SoS) to design differentially private algorithms. SoS proofs to algorithms is a key theme in numerous recent works in high-dimensional algorithmic statistics -- estimators which apparently require exponential running time but whose analysis can be captured by low-degree Sum of Squares proofs can be automatically turned into polynomial-time algorithms with the same provable guarantees. We demonstrate a similar proofs to private algorithms phenomenon: instances of the workhorse exponential mechanism which apparently require exponential time but which can be analyzed with low-degree SoS proofs can be automatically turned into polynomial-time differentially private algorithms. We prove a meta-theorem capturing this phenomenon, which we expect to be of broad use in private algorithm design. Our techniques also draw new connections between differentially private and robust statistics in high dimensions. In particular, viewed through our proofs-to-private-algorithms lens, several well-studied SoS proofs from recent works in algorithmic robust statistics directly yield key components of our differentially private mean estimation algorithm., Comment: 66 pages, STOC 2022
- Published
- 2021
- Full Text
- View/download PDF
8. Robust Estimation for Random Graphs
- Author
-
Acharya, Jayadev, Jain, Ayush, Kamath, Gautam, Suresh, Ananda Theertha, and Zhang, Huanyu
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
We study the problem of robustly estimating the parameter $p$ of an Erd\H{o}s-R\'enyi random graph on $n$ nodes, where a $\gamma$ fraction of nodes may be adversarially corrupted. After showing the deficiencies of canonical estimators, we design a computationally-efficient spectral algorithm which estimates $p$ up to accuracy $\tilde O(\sqrt{p(1-p)}/n + \gamma\sqrt{p(1-p)} /\sqrt{n}+ \gamma/n)$ for $\gamma < 1/60$. Furthermore, we give an inefficient algorithm with similar accuracy for all $\gamma <1/2$, the information-theoretic limit. Finally, we prove a nearly-matching statistical lower bound, showing that the error of our algorithms is optimal up to logarithmic factors.
- Published
- 2021
9. A Private and Computationally-Efficient Estimator for Unbounded Gaussians
- Author
-
Kamath, Gautam, Mouzakis, Argyris, Singhal, Vikrant, Steinke, Thomas, and Ullman, Jonathan
- Subjects
Statistics - Machine Learning ,Computer Science - Cryptography and Security ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning - Abstract
We give the first polynomial-time, polynomial-sample, differentially private estimator for the mean and covariance of an arbitrary Gaussian distribution $\mathcal{N}(\mu,\Sigma)$ in $\mathbb{R}^d$. All previous estimators are either nonconstructive, with unbounded running time, or require the user to specify a priori bounds on the parameters $\mu$ and $\Sigma$. The primary new technical tool in our algorithm is a new differentially private preconditioner that takes samples from an arbitrary Gaussian $\mathcal{N}(0,\Sigma)$ and returns a matrix $A$ such that $A \Sigma A^T$ has constant condition number.
- Published
- 2021
10. Sketching and Sequence Alignment: A Rate-Distortion Perspective
- Author
-
Shomorony, Ilan and Kamath, Govinda M.
- Subjects
Computer Science - Information Theory - Abstract
Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length $n$ and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require $B = (1/D) \cdot O\left( \log n \right)$ bits to achieve a distortion $D$, our proposed approach only requires $B = \log^2(1/D) \cdot O(1)$ bits. This can lead to significant computational savings in pairwise alignment estimation.
- Published
- 2021
11. The Price of Tolerance in Distribution Testing
- Author
-
Canonne, Clément L., Jain, Ayush, Kamath, Gautam, and Li, Jerry
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Mathematics - Probability ,Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
We revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution $p$ over $\{1, \dots, n\}$, is it $\varepsilon_1$-close to or $\varepsilon_2$-far from a reference distribution $q$ (in total variation distance)? Despite significant interest over the past decade, this problem is well understood only in the extreme cases. In the noiseless setting (i.e., $\varepsilon_1 = 0$) the sample complexity is $\Theta(\sqrt{n})$, strongly sublinear in the domain size. At the other end of the spectrum, when $\varepsilon_1 = \varepsilon_2/2$, the sample complexity jumps to the barely sublinear $\Theta(n/\log n)$. However, very little is known about the intermediate regime. We fully characterize the price of tolerance in distribution testing as a function of $n$, $\varepsilon_1$, $\varepsilon_2$, up to a single $\log n$ factor. Specifically, we show the sample complexity to be \[\tilde \Theta\left(\frac{\sqrt{n}}{\varepsilon_2^{2}} + \frac{n}{\log n} \cdot \max \left\{\frac{\varepsilon_1}{\varepsilon_2^2},\left(\frac{\varepsilon_1}{\varepsilon_2^2}\right)^{\!\!2}\right\}\right),\] providing a smooth tradeoff between the two previously known cases. We also provide a similar characterization for the problem of tolerant equivalence testing, where both $p$ and $q$ are unknown. Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio $\varepsilon_1/\varepsilon_2^2$, and not the more intuitive $\varepsilon_1/\varepsilon_2$. Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between $\varepsilon_1$ and $\varepsilon_2$, a challenge absent from previous works., Comment: Added a result on instance-optimal testing, and further discussion in the introduction
- Published
- 2021
12. Fundamental Limits of Demand-Private Coded Caching
- Author
-
Gurjarpadhye, Chinmay, Ravi, Jithin, Kamath, Sneha, Dey, Bikash Kumar, and Karamchandani, Nikhil
- Subjects
Computer Science - Information Theory - Abstract
We consider the coded caching problem with an additional privacy constraint that a user should not get any information about the demands of the other users. We first show that a demand-private scheme for $N$ files and $K$ users can be obtained from a non-private scheme that serves only a subset of the demands for the $N$ files and $NK$ users problem. We further use this fact to construct a demand-private scheme for $N$ files and $K$ users from a particular known non-private scheme for $N$ files and $NK-K+1$ users. It is then demonstrated that, the memory-rate pair $(M,\min \{N,K\}(1-M/N))$, which is achievable for non-private schemes with uncoded transmissions, is also achievable under demand privacy. We further propose a scheme that improves on these ideas by removing some redundant transmissions. The memory-rate trade-off achieved using our schemes is shown to be within a multiplicative factor of 3 from the optimal when $K < N$ and of 8 when $N\leq K$. Finally, we give the exact memory-rate trade-off for demand-private coded caching problems with $N\geq K=2$., Comment: 43 pages, 6 figures
- Published
- 2021
13. Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment
- Author
-
Kamath, Govinda M., Baharav, Tavor Z., and Shomorony, Ilan
- Subjects
Computer Science - Machine Learning ,Computer Science - Information Theory ,Quantitative Biology - Genomics ,Statistics - Machine Learning - Abstract
Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. State-of-the-art approaches to speed up this task use hashing to identify short segments (k-mers) that are shared by pairs of reads, which can then be used to estimate alignment scores. However, when the number of reads is large, accurately estimating alignment scores for all pairs is still very costly. Moreover, in practice, one is only interested in identifying pairs of reads with large alignment scores. In this work, we propose a new approach to pairwise alignment estimation based on two key new ingredients. The first ingredient is to cast the problem of pairwise alignment estimation under a general framework of rank-one crowdsourcing models, where the workers' responses correspond to k-mer hash collisions. These models can be accurately solved via a spectral decomposition of the response matrix. The second ingredient is to utilise a multi-armed bandit algorithm to adaptively refine this spectral estimator only for read pairs that are likely to have large alignments. The resulting algorithm iteratively performs a spectral decomposition of the response matrix for adaptively chosen subsets of the read pairs., Comment: NeurIPS 2020
- Published
- 2020
14. On the Sample Complexity of Privately Learning Unbounded High-Dimensional Gaussians
- Author
-
Aden-Ali, Ishaq, Ashtiani, Hassan, and Kamath, Gautam
- Subjects
Statistics - Machine Learning ,Computer Science - Cryptography and Security ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning - Abstract
We provide sample complexity upper bounds for agnostically learning multivariate Gaussians under the constraint of approximate differential privacy. These are the first finite sample upper bounds for general Gaussians which do not impose restrictions on the parameters of the distribution. Our bounds are near-optimal in the case when the covariance is known to be the identity, and conjectured to be near-optimal in the general case. From a technical standpoint, we provide analytic tools for arguing the existence of global "locally small" covers from local covers of the space. These are exploited using modifications of recent techniques for differentially private hypothesis selection. Our techniques may prove useful for privately learning other distribution classes which do not possess a finite cover.
- Published
- 2020
15. CoinPress: Practical Private Mean and Covariance Estimation
- Author
-
Biswas, Sourav, Dong, Yihe, Kamath, Gautam, and Ullman, Jonathan
- Subjects
Statistics - Machine Learning ,Computer Science - Cryptography and Security ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Mathematics - Statistics Theory - Abstract
We present simple differentially private estimators for the mean and covariance of multivariate sub-Gaussian data that are accurate at small sample sizes. We demonstrate the effectiveness of our algorithms both theoretically and empirically using synthetic and real-world datasets -- showing that their asymptotic error rates match the state-of-the-art theoretical bounds, and that they concretely outperform all previous methods. Specifically, previous estimators either have weak empirical accuracy at small sample sizes, perform poorly for multivariate data, or require the user to provide strong a priori estimates for the parameters., Comment: Code is available at https://github.com/twistedcubic/coin-press
- Published
- 2020
16. A Primer on Private Statistics
- Author
-
Kamath, Gautam and Ullman, Jonathan
- Subjects
Statistics - Machine Learning ,Computer Science - Cryptography and Security ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning - Abstract
Differentially private statistical estimation has seen a flurry of developments over the last several years. Study has been divided into two schools of thought, focusing on empirical statistics versus population statistics. We suggest that these two lines of work are more similar than different by giving examples of methods that were initially framed for empirical statistics, but can be applied just as well to population statistics. We also provide a thorough coverage of recent work in this area., Comment: 20 pages. Comments welcome
- Published
- 2020
17. Private Mean Estimation of Heavy-Tailed Distributions
- Author
-
Kamath, Gautam, Singhal, Vikrant, and Ullman, Jonathan
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We give new upper and lower bounds on the minimax sample complexity of differentially private mean estimation of distributions with bounded $k$-th moments. Roughly speaking, in the univariate case, we show that $n = \Theta\left(\frac{1}{\alpha^2} + \frac{1}{\alpha^{\frac{k}{k-1}}\varepsilon}\right)$ samples are necessary and sufficient to estimate the mean to $\alpha$-accuracy under $\varepsilon$-differential privacy, or any of its common relaxations. This result demonstrates a qualitatively different behavior compared to estimation absent privacy constraints, for which the sample complexity is identical for all $k \geq 2$. We also give algorithms for the multivariate setting whose sample complexity is a factor of $O(d)$ larger than the univariate case., Comment: Appeared in COLT 2020
- Published
- 2020
18. Locally Private Hypothesis Selection
- Author
-
Gopi, Sivakanth, Kamath, Gautam, Kulkarni, Janardhan, Nikolov, Aleksandar, Wu, Zhiwei Steven, and Zhang, Huanyu
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We initiate the study of hypothesis selection under local differential privacy. Given samples from an unknown probability distribution $p$ and a set of $k$ probability distributions $\mathcal{Q}$, we aim to output, under the constraints of $\varepsilon$-local differential privacy, a distribution from $\mathcal{Q}$ whose total variation distance to $p$ is comparable to the best such distribution. This is a generalization of the classic problem of $k$-wise simple hypothesis testing, which corresponds to when $p \in \mathcal{Q}$, and we wish to identify $p$. Absent privacy constraints, this problem requires $O(\log k)$ samples from $p$, and it was recently shown that the same complexity is achievable under (central) differential privacy. However, the naive approach to this problem under local differential privacy would require $\tilde O(k^2)$ samples. We first show that the constraint of local differential privacy incurs an exponential increase in cost: any algorithm for this problem requires at least $\Omega(k)$ samples. Second, for the special case of $k$-wise simple hypothesis testing, we provide a non-interactive algorithm which nearly matches this bound, requiring $\tilde O(k)$ samples. Finally, we provide sequentially interactive algorithms for the general case, requiring $\tilde O(k)$ samples and only $O(\log \log k)$ rounds of interactivity. Our algorithms are achieved through a reduction to maximum selection with adversarial comparators, a problem of independent interest for which we initiate study in the parallel setting. For this problem, we provide a family of algorithms for each number of allowed rounds of interaction $t$, as well as lower bounds showing that they are near-optimal for every $t$. Notably, our algorithms result in exponential improvements on the round complexity of previous methods., Comment: To appear in COLT 2020
- Published
- 2020
19. Lower Bounds for Compressed Sensing with Generative Models
- Author
-
Kamath, Akshay, Karmalkar, Sushrut, and Price, Eric
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning - Abstract
The goal of compressed sensing is to learn a structured signal $x$ from a limited number of noisy linear measurements $y \approx Ax$. In traditional compressed sensing, "structure" is represented by sparsity in some known basis. Inspired by the success of deep learning in modeling images, recent work starting with~\cite{BJPD17} has instead considered structure to come from a generative model $G: \mathbb{R}^k \to \mathbb{R}^n$. We present two results establishing the difficulty of this latter task, showing that existing bounds are tight. First, we provide a lower bound matching the~\cite{BJPD17} upper bound for compressed sensing from $L$-Lipschitz generative models $G$. In particular, there exists such a function that requires roughly $\Omega(k \log L)$ linear measurements for sparse recovery to be possible. This holds even for the more relaxed goal of \emph{nonuniform} recovery. Second, we show that generative models generalize sparsity as a representation of structure. In particular, we construct a ReLU-based neural network $G: \mathbb{R}^{2k} \to \mathbb{R}^n$ with $O(1)$ layers and $O(kn)$ activations per layer, such that the range of $G$ contains all $k$-sparse vectors.
- Published
- 2019
20. Random Restrictions of High-Dimensional Distributions and Uniformity Testing with Subcube Conditioning
- Author
-
Canonne, Clément L., Chen, Xi, Kamath, Gautam, Levi, Amit, and Waingarten, Erik
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Mathematics - Probability ,Mathematics - Statistics Theory - Abstract
We give a nearly-optimal algorithm for testing uniformity of distributions supported on $\{-1,1\}^n$, which makes $\tilde O (\sqrt{n}/\varepsilon^2)$ queries to a subcube conditional sampling oracle (Bhattacharyya and Chakraborty (2018)). The key technical component is a natural notion of random restriction for distributions on $\{-1,1\}^n$, and a quantitative analysis of how such a restriction affects the mean vector of the distribution. Along the way, we consider the problem of mean testing with independent samples and provide a nearly-optimal algorithm., Comment: Added Remark 4.4, which discusses the time complexity (the algorithms are polynomial-time, based on an observation from [CJLW20]); removing log log log n factor for the Gaussian testing algorithm. These changes reflect those included in the conference version (SODA'21)
- Published
- 2019
21. Demand-Private Coded Caching and the Exact Trade-off for N=K=2
- Author
-
Kamath, Sneha, Ravi, Jithin, and Dey, Bikash Kumar
- Subjects
Computer Science - Information Theory - Abstract
The distributed coded caching problem has been studied extensively in the recent past. While the known coded caching schemes achieve an improved transmission rate, they violate the privacy of the users since in these schemes the demand of one user is revealed to others in the delivery phase. In this paper, we consider the coded caching problem under the constraint that the demands of the other users remain information theoretically secret from each user. We first show that the memory-rate pair $(M,\min \{N,K\}(1-M/N))$ is achievable under information theoretic demand privacy, while using broadcast transmissions. We then show that a demand-private scheme for $N$ files and $K$ users can be obtained from a non-private scheme that satisfies only a restricted subset of demands of $NK$ users for $N$ files. We then focus on the demand-private coded caching problem for $K=2$ users, $N=2$ files. We characterize the exact memory-rate trade-off for this case. To show the achievability, we use our first result to construct a demand-private scheme from a non-private scheme satisfying a restricted demand subset that is known from an earlier work by Tian. Further, by giving a converse based on the extra requirement of privacy, we show that the obtained achievable region is the exact memory-rate trade-off., Comment: 8 pages, 2 figures
- Published
- 2019
22. Differentially Private Algorithms for Learning Mixtures of Separated Gaussians
- Author
-
Kamath, Gautam, Sheffet, Or, Singhal, Vikrant, and Ullman, Jonathan
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Learning the parameters of Gaussian mixture models is a fundamental and widely studied problem with numerous applications. In this work, we give new algorithms for learning the parameters of a high-dimensional, well separated, Gaussian mixture model subject to the strong constraint of differential privacy. In particular, we give a differentially private analogue of the algorithm of Achlioptas and McSherry. Our algorithm has two key properties not achieved by prior work: (1) The algorithm's sample complexity matches that of the corresponding non-private algorithm up to lower order terms in a wide range of parameters. (2) The algorithm does not require strong a priori bounds on the parameters of the mixture components., Comment: To appear in NeurIPS 2019
- Published
- 2019
23. Demand Private Coded Caching
- Author
-
Kamath, Sneha
- Subjects
Computer Science - Information Theory - Abstract
The work by Maddah-Ali and Niesen demonstrated the benefits in reducing the transmission rate in a noiseless broadcast network by joint design of caching and delivery schemes. In their setup, each user learns the demands of all other users in the delivery phase. In this paper, we introduce the problem of demand private coded caching where we impose a privacy requirement that no user learns any information about the demands of other users. We provide an achievable scheme and compare its performance using the existing lower bounds on the achievable rates under no privacy setting. For this setting, when $N\leq K$ we show that our scheme is order optimal within a multiplicative factor of 8. Furthermore, when $N > K$ and $M\geq N/K$, our scheme is order optimal within a multiplicative factor of 4., Comment: 14 pages, 3 figures
- Published
- 2019
24. Optimal Information Updating based on Value of Information
- Author
-
Singh, Rahul, Kamath, Gopal Krishna, and Kumar, P. R.
- Subjects
Computer Science - Information Theory ,Computer Science - Networking and Internet Architecture - Abstract
We address the problem of how to optimally schedule data packets over an unreliable channel in order to minimize the estimation error of a simple-to-implement remote linear estimator using a constant "Kalman'' gain to track the state of a Gauss Markov process. The remote estimator receives time-stamped data packets which contain noisy observations of the process. Additionally, they also contain the information about the "quality'' of the sensor\ source, i.e., the variance of the observation noise that was used to generate the packet. In order to minimize the estimation error, the scheduler needs to use both while prioritizing packet transmissions. It is shown that a simple index rule that calculates the value of information (VoI) of each packet, and then schedules the packet with the largest current value of VoI, is optimal. The VoI of a packet decreases with its age, and increases with the precision of the source. Thus, we conclude that, for constant filter gains, a policy which minimizes the age of information does not necessarily maximize the estimator performance., Comment: Accepted in Allerton 2019
- Published
- 2019
25. Private Identity Testing for High-Dimensional Distributions
- Author
-
Canonne, Clément L., Kamath, Gautam, McMillan, Audra, Ullman, Jonathan, and Zakynthinou, Lydia
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in $\mathbb{R}^d$ with known covariance and product distributions over $\{\pm 1\}^{d}$. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of $O(d^{1/2}/\alpha^2)$ in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions., Comment: Discussing a mistake in the proof of one of the algorithms (Theorem 1.2, computationally inefficient tester), and pointing to follow-up work by Narayanan (2022) who improves upon our results and fixes this mistake
- Published
- 2019
26. The Structure of Optimal Private Tests for Simple Hypotheses
- Author
-
Canonne, Clément L., Kamath, Gautam, McMillan, Audra, Smith, Adam, and Ullman, Jonathan
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Hypothesis testing plays a central role in statistical inference, and is used in many settings where privacy concerns are paramount. This work answers a basic question about privately testing simple hypotheses: given two distributions $P$ and $Q$, and a privacy level $\varepsilon$, how many i.i.d. samples are needed to distinguish $P$ from $Q$ subject to $\varepsilon$-differential privacy, and what sort of tests have optimal sample complexity? Specifically, we characterize this sample complexity up to constant factors in terms of the structure of $P$ and $Q$ and the privacy level $\varepsilon$, and show that this sample complexity is achieved by a certain randomized and clamped variant of the log-likelihood ratio test. Our result is an analogue of the classical Neyman-Pearson lemma in the setting of private hypothesis testing. We also give an application of our result to the private change-point detection. Our characterization applies more generally to hypothesis tests satisfying essentially any notion of algorithmic stability, which is known to imply strong generalization bounds in adaptive data analysis, and thus our results have applications even when privacy is not a primary concern., Comment: To appear in STOC 2019
- Published
- 2018
27. An Operational Approach to Information Leakage
- Author
-
Issa, Ibrahim, Wagner, Aaron B., and Kamath, Sudeep
- Subjects
Computer Science - Information Theory - Abstract
Given two random variables $X$ and $Y$, an operational approach is undertaken to quantify the ``leakage'' of information from $X$ to $Y$. The resulting measure $\mathcal{L}(X \!\! \to \!\! Y)$ is called \emph{maximal leakage}, and is defined as the multiplicative increase, upon observing $Y$, of the probability of correctly guessing a randomized function of $X$, maximized over all such randomized functions. A closed-form expression for $\mathcal{L}(X \!\! \to \!\! Y)$ is given for discrete $X$ and $Y$, and it is subsequently generalized to handle a large class of random variables. The resulting properties are shown to be consistent with an axiomatic view of a leakage measure, and the definition is shown to be robust to variations in the setup. Moreover, a variant of the Shannon cipher system is studied, in which performance of an encryption scheme is measured using maximal leakage. A single-letter characterization of the optimal limit of (normalized) maximal leakage is derived and asymptotically-optimal encryption schemes are demonstrated. Furthermore, the sample complexity of estimating maximal leakage from data is characterized up to subpolynomial factors. Finally, the \emph{guessing} framework used to define maximal leakage is used to give operational interpretations of commonly used leakage measures, such as Shannon capacity, maximal correlation, and local differential privacy., Comment: Submitted to IEEE Transactions on Information Theory (appeared in part in CISS 2016, ISIT 2016 & 2017)
- Published
- 2018
28. Anaconda: A Non-Adaptive Conditional Sampling Algorithm for Distribution Testing
- Author
-
Kamath, Gautam and Tzamos, Christos
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Mathematics - Probability ,Mathematics - Statistics Theory - Abstract
We investigate distribution testing with access to non-adaptive conditional samples. In the conditional sampling model, the algorithm is given the following access to a distribution: it submits a query set $S$ to an oracle, which returns a sample from the distribution conditioned on being from $S$. In the non-adaptive setting, all query sets must be specified in advance of viewing the outcomes. Our main result is the first polylogarithmic-query algorithm for equivalence testing, deciding whether two unknown distributions are equal to or far from each other. This is an exponential improvement over the previous best upper bound, and demonstrates that the complexity of the problem in this model is intermediate to the the complexity of the problem in the standard sampling model and the adaptive conditional sampling model. We also significantly improve the sample complexity for the easier problems of uniformity and identity testing. For the former, our algorithm requires only $\tilde O(\log n)$ queries, matching the information-theoretic lower bound up to a $O(\log \log n)$-factor. Our algorithm works by reducing the problem from $\ell_1$-testing to $\ell_\infty$-testing, which enjoys a much cheaper sample complexity. Necessitated by the limited power of the non-adaptive model, our algorithm is very simple to state. However, there are significant challenges in the analysis, due to the complex structure of how two arbitrary distributions may differ., Comment: SODA 2019
- Published
- 2018
29. Bandit-Based Monte Carlo Optimization for Nearest Neighbors
- Author
-
Bagaria, Vivek, Baharav, Tavor Z., Kamath, Govinda M., and Tse, David N.
- Subjects
Computer Science - Machine Learning ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Statistics - Computation ,Statistics - Machine Learning - Abstract
The celebrated Monte Carlo method estimates an expensive-to-compute quantity by random sampling. Bandit-based Monte Carlo optimization is a general technique for computing the minimum of many such expensive-to-compute quantities by adaptive random sampling. The technique converts an optimization problem into a statistical estimation problem which is then solved via multi-armed bandits. We apply this technique to solve the problem of high-dimensional $k$-nearest neighbors, developing an algorithm which we prove is able to identify exact nearest neighbors with high probability. We show that under regularity assumptions on a dataset of $n$ points in $d$-dimensional space, the complexity of our algorithm scales logarithmically with the dimension of the data as $O\left((n+d)\log^2 \left(\frac{nd}{\delta}\right)\right)$ for error probability $\delta$, rather than linearly as in exact computation requiring $O(nd)$. We corroborate our theoretical results with numerical simulations, showing that our algorithm outperforms both exact computation and state-of-the-art algorithms such as kGraph, NGT, and LSH on real datasets., Comment: Accepted to the IEEE Journal on Selected Areas in Information Theory (JSAIT) - Special Issue on Sequential, Active, and Reinforcement Learning
- Published
- 2018
30. INSPECTRE: Privately Estimating the Unseen
- Author
-
Acharya, Jayadev, Kamath, Gautam, Sun, Ziteng, and Zhang, Huanyu
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Learning ,Mathematics - Statistics Theory - Abstract
We develop differentially private methods for estimating various distributional properties. Given a sample from a discrete distribution $p$, some functional $f$, and accuracy and privacy parameters $\alpha$ and $\varepsilon$, the goal is to estimate $f(p)$ up to accuracy $\alpha$, while maintaining $\varepsilon$-differential privacy of the sample. We prove almost-tight bounds on the sample size required for this problem for several functionals of interest, including support size, support coverage, and entropy. We show that the cost of privacy is negligible in a variety of settings, both theoretically and experimentally. Our methods are based on a sensitivity analysis of several state-of-the-art methods for estimating these properties with sublinear sample complexities.
- Published
- 2018
31. Medoids in almost linear time via multi-armed bandits
- Author
-
Bagaria, Vivek, Kamath, Govinda M., Ntranos, Vasilis, Zhang, Martin J., and Tse, David
- Subjects
Statistics - Machine Learning ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Learning - Abstract
Computing the medoid of a large number of points in high-dimensional space is an increasingly common operation in many data science problems. We present an algorithm Med-dit which uses O(n log n) distance evaluations to compute the medoid with high probability. Med-dit is based on a connection with the multi-armed bandit problem. We evaluate the performance of Med-dit empirically on the Netflix-prize and the single-cell RNA-Seq datasets, containing hundreds of thousands of points living in tens of thousands of dimensions, and observe a 5-10x improvement in performance over the current state of the art. Med-dit is available at https://github.com/bagavi/Meddit
- Published
- 2017
32. Dimension Reduction for Polynomials over Gaussian Space and Applications
- Author
-
Ghazi, Badih, Kamath, Pritish, and Raghavendra, Prasad
- Subjects
Computer Science - Computational Complexity ,Computer Science - Information Theory - Abstract
We introduce a new technique for reducing the dimension of the ambient space of low-degree polynomials in the Gaussian space while preserving their relative correlation structure, analogous to the Johnson-Lindenstrauss lemma. As applications, we address the following problems: 1. Computability of Approximately Optimal Noise Stable function over Gaussian space: The goal is to find a partition of $\mathbb{R}^n$ into $k$ parts, that maximizes the noise stability. An $\delta$-optimal partition is one which is within additive $\delta$ of the optimal noise stability. De, Mossel & Neeman (CCC 2017) raised the question of proving a computable bound on the dimension $n_0(\delta)$ in which we can find an $\delta$-optimal partition. While De et al. provide such a bound, using our new technique, we obtain improved explicit bounds on the dimension $n_0(\delta)$. 2. Decidability of Non-Interactive Simulation of Joint Distributions: A "non-interactive simulation" problem is specified by two distributions $P(x,y)$ and $Q(u,v)$: The goal is to determine if two players that observe sequences $X^n$ and $Y^n$ respectively where $\{(X_i, Y_i)\}_{i=1}^n$ are drawn i.i.d. from $P(x,y)$ can generate pairs $U$ and $V$ respectively (without communicating with each other) with a joint distribution that is arbitrarily close in total variation to $Q(u,v)$. Even when $P$ and $Q$ are extremely simple, it is open in several cases if $P$ can simulate $Q$. In the special where $Q$ is a joint distribution over $\{0,1\} \times \{0,1\}$, Ghazi, Kamath and Sudan (FOCS 2016) proved a computable bound on the number of samples $n_0(\delta)$ that can be drawn from $P(x,y)$ to get $\delta$-close to $Q$ (if it is possible at all). Recently De, Mossel & Neeman obtained such bounds when $Q$ is a distribution over $[k] \times [k]$ for any $k \ge 2$. We recover this result with improved explicit bounds on $n_0(\delta)$.
- Published
- 2017
33. Which Distribution Distances are Sublinearly Testable?
- Author
-
Daskalakis, Constantinos, Kamath, Gautam, and Wright, John
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Learning ,Mathematics - Statistics Theory - Abstract
Given samples from an unknown distribution $p$ and a description of a distribution $q$, are $p$ and $q$ close or far? This question of "identity testing" has received significant attention in the case of testing whether $p$ and $q$ are equal or far in total variation distance. However, in recent work, the following questions have been been critical to solving problems at the frontiers of distribution testing: -Alternative Distances: Can we test whether $p$ and $q$ are far in other distances, say Hellinger? -Tolerance: Can we test when $p$ and $q$ are close, rather than equal? And if so, close in which distances? Motivated by these questions, we characterize the complexity of distribution testing under a variety of distances, including total variation, $\ell_2$, Hellinger, Kullback-Leibler, and $\chi^2$. For each pair of distances $d_1$ and $d_2$, we study the complexity of testing if $p$ and $q$ are close in $d_1$ versus far in $d_2$, with a focus on identifying which problems allow strongly sublinear testers (i.e., those with complexity $O(n^{1 - \gamma})$ for some $\gamma > 0$ where $n$ is the size of the support of the distributions $p$ and $q$). We provide matching upper and lower bounds for each case. We also study these questions in the case where we only have samples from $q$ (equivalence testing), showing qualitative differences from identity testing in terms of when tolerance can be achieved. Our algorithms fall into the classical paradigm of $\chi^2$-statistics, but require crucial changes to handle the challenges introduced by each distance we consider. Finally, we survey other recent results in an attempt to serve as a reference for the complexity of various distribution testing problems., Comment: To appear in SODA 2018
- Published
- 2017
34. Improved Bounds for Universal One-Bit Compressive Sensing
- Author
-
Acharya, Jayadev, Bhattacharyya, Arnab, and Kamath, Pritish
- Subjects
Computer Science - Information Theory - Abstract
Unlike compressive sensing where the measurement outputs are assumed to be real-valued and have infinite precision, in "one-bit compressive sensing", measurements are quantized to one bit, their signs. In this work, we show how to recover the support of sparse high-dimensional vectors in the one-bit compressive sensing framework with an asymptotically near-optimal number of measurements. We also improve the bounds on the number of measurements for approximately recovering vectors from one-bit compressive sensing measurements. Our results are universal, namely the same measurement scheme works simultaneously for all sparse vectors. Our proof of optimality for support recovery is obtained by showing an equivalence between the task of support recovery using 1-bit compressive sensing and a well-studied combinatorial object known as Union Free Families., Comment: 14 pages
- Published
- 2017
35. Robustly Learning a Gaussian: Getting Optimal Error, Efficiently
- Author
-
Diakonikolas, Ilias, Kamath, Gautam, Kane, Daniel M., Li, Jerry, Moitra, Ankur, and Stewart, Alistair
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Learning ,Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
We study the fundamental problem of learning the parameters of a high-dimensional Gaussian in the presence of noise -- where an $\varepsilon$-fraction of our samples were chosen by an adversary. We give robust estimators that achieve estimation error $O(\varepsilon)$ in the total variation distance, which is optimal up to a universal constant that is independent of the dimension. In the case where just the mean is unknown, our robustness guarantee is optimal up to a factor of $\sqrt{2}$ and the running time is polynomial in $d$ and $1/\epsilon$. When both the mean and covariance are unknown, the running time is polynomial in $d$ and quasipolynomial in $1/\varepsilon$. Moreover all of our algorithms require only a polynomial number of samples. Our work shows that the same sorts of error guarantees that were established over fifty years ago in the one-dimensional setting can also be achieved by efficient algorithms in high-dimensional settings., Comment: To appear in SODA 2018
- Published
- 2017
36. Priv'IT: Private and Sample Efficient Identity Testing
- Author
-
Cai, Bryan, Daskalakis, Constantinos, and Kamath, Gautam
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Cryptography and Security ,Computer Science - Information Theory ,Computer Science - Learning ,Mathematics - Statistics Theory - Abstract
We develop differentially private hypothesis testing methods for the small sample regime. Given a sample $\cal D$ from a categorical distribution $p$ over some domain $\Sigma$, an explicitly described distribution $q$ over $\Sigma$, some privacy parameter $\varepsilon$, accuracy parameter $\alpha$, and requirements $\beta_{\rm I}$ and $\beta_{\rm II}$ for the type I and type II errors of our test, the goal is to distinguish between $p=q$ and $d_{\rm{TV}}(p,q) \geq \alpha$. We provide theoretical bounds for the sample size $|{\cal D}|$ so that our method both satisfies $(\varepsilon,0)$-differential privacy, and guarantees $\beta_{\rm I}$ and $\beta_{\rm II}$ type I and type II errors. We show that differential privacy may come for free in some regimes of parameters, and we always beat the sample complexity resulting from running the $\chi^2$-test with noisy counts, or standard approaches such as repetition for endowing non-private $\chi^2$-style statistics with differential privacy guarantees. We experimentally compare the sample complexity of our method to that of recently proposed methods for private hypothesis testing., Comment: To appear in ICML 2017
- Published
- 2017
37. Being Robust (in High Dimensions) Can Be Practical
- Author
-
Diakonikolas, Ilias, Kamath, Gautam, Kane, Daniel M., Li, Jerry, Moitra, Ankur, and Stewart, Alistair
- Subjects
Computer Science - Learning ,Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Statistics - Machine Learning - Abstract
Robust estimation is much more challenging in high dimensions than it is in one dimension: Most techniques either lead to intractable optimization problems or estimators that can tolerate only a tiny fraction of errors. Recent work in theoretical computer science has shown that, in appropriate distributional models, it is possible to robustly estimate the mean and covariance with polynomial time algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. However, the sample and time complexity of these algorithms is prohibitively large for high-dimensional applications. In this work, we address both of these issues by establishing sample complexity bounds that are optimal, up to logarithmic factors, as well as giving various refinements that allow the algorithms to tolerate a much larger fraction of corruptions. Finally, we show on both synthetic and real data that our algorithms have state-of-the-art performance and suddenly make high-dimensional robust estimation a realistic possibility., Comment: Appeared in ICML 2017
- Published
- 2017
38. Testing Ising Models
- Author
-
Daskalakis, Constantinos, Dikkala, Nishanth, and Kamath, Gautam
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Mathematics - Probability ,Mathematics - Statistics Theory - Abstract
Given samples from an unknown multivariate distribution $p$, is it possible to distinguish whether $p$ is the product of its marginals versus $p$ being far from every product distribution? Similarly, is it possible to distinguish whether $p$ equals a given distribution $q$ versus $p$ and $q$ being far from each other? These problems of testing independence and goodness-of-fit have received enormous attention in statistics, information theory, and theoretical computer science, with sample-optimal algorithms known in several interesting regimes of parameters. Unfortunately, it has also been understood that these problems become intractable in large dimensions, necessitating exponential sample complexity. Motivated by the exponential lower bounds for general distributions as well as the ubiquity of Markov Random Fields (MRFs) in the modeling of high-dimensional distributions, we initiate the study of distribution testing on structured multivariate distributions, and in particular the prototypical example of MRFs: the Ising Model. We demonstrate that, in this structured setting, we can avoid the curse of dimensionality, obtaining sample and time efficient testers for independence and goodness-of-fit. One of the key technical challenges we face along the way is bounding the variance of functions of the Ising model., Comment: Appeared SODA 2018. Final version to appear in IEEE Transactions on Information Theory
- Published
- 2016
39. Optimality of Correlated Sampling Strategies
- Author
-
Bavarian, Mohammad, Ghazi, Badih, Haramaty, Elad, Kamath, Pritish, Rivest, Ronald L., and Sudan, Madhu
- Subjects
Computer Science - Computational Complexity ,Computer Science - Information Theory - Abstract
In the "correlated sampling" problem, two players are given probability distributions $P$ and $Q$, respectively, over the same finite set, with access to shared randomness. Without any communication, the two players are each required to output an element sampled according to their respective distributions, while trying to minimize the probability that their outputs disagree. A well known strategy due to Kleinberg-Tardos and Holenstein, with a close variant (for a similar problem) due to Broder, solves this task with disagreement probability at most $2 \delta/(1+\delta)$, where $\delta$ is the total variation distance between $P$ and $Q$. This strategy has been used in several different contexts, including sketching algorithms, approximation algorithms based on rounding linear programming relaxations, the study of parallel repetition and cryptography. In this paper, we give a surprisingly simple proof that this strategy is essentially optimal. Specifically, for every $\delta \in (0,1)$, we show that any correlated sampling strategy incurs a disagreement probability of essentially $2\delta/(1+\delta)$ on some inputs $P$ and $Q$ with total variation distance at most $\delta$. This partially answers a recent question of Rivest. Our proof is based on studying a new problem that we call "constrained agreement". Here, the two players are given subsets $A \subseteq [n]$ and $B \subseteq [n]$, respectively, and their goal is to output an element $i \in A$ and $j \in B$, respectively, while minimizing the probability that $i \neq j$. We prove tight bounds for this question, which in turn imply tight bounds for correlated sampling. Though we settle basic questions about the two problems, our formulation leads to more fine-grained questions that remain open., Comment: 12 pages; Improved presentation (again) based on feedback from anonymous ToC reviewers
- Published
- 2016
40. Decidability of Non-Interactive Simulation of Joint Distributions
- Author
-
Ghazi, Badih, Kamath, Pritish, and Sudan, Madhu
- Subjects
Computer Science - Information Theory ,Computer Science - Computational Complexity - Abstract
We present decidability results for a sub-class of "non-interactive" simulation problems, a well-studied class of problems in information theory. A non-interactive simulation problem is specified by two distributions $P(x,y)$ and $Q(u,v)$: The goal is to determine if two players, Alice and Bob, that observe sequences $X^n$ and $Y^n$ respectively where $\{(X_i, Y_i)\}_{i=1}^n$ are drawn i.i.d. from $P(x,y)$ can generate pairs $U$ and $V$ respectively (without communicating with each other) with a joint distribution that is arbitrarily close in total variation to $Q(u,v)$. Even when $P$ and $Q$ are extremely simple: e.g., $P$ is uniform on the triples $\{(0,0), (0,1), (1,0)\}$ and $Q$ is a "doubly symmetric binary source", i.e., $U$ and $V$ are uniform $\pm 1$ variables with correlation say $0.49$, it is open if $P$ can simulate $Q$. In this work, we show that whenever $P$ is a distribution on a finite domain and $Q$ is a $2 \times 2$ distribution, then the non-interactive simulation problem is decidable: specifically, given $\delta > 0$ the algorithm runs in time bounded by some function of $P$ and $\delta$ and either gives a non-interactive simulation protocol that is $\delta$-close to $Q$ or asserts that no protocol gets $O(\delta)$-close to $Q$. The main challenge to such a result is determining explicit (computable) convergence bounds on the number $n$ of samples that need to be drawn from $P(x,y)$ to get $\delta$-close to $Q$. We invoke contemporary results from the analysis of Boolean functions such as the invariance principle and a regularity lemma to obtain such explicit bounds.
- Published
- 2016
41. Partial DNA Assembly: A Rate-Distortion Perspective
- Author
-
Shomorony, Ilan, Kamath, Govinda M., Xia, Fei, Courtade, Thomas A., and Tse, David N.
- Subjects
Computer Science - Information Theory ,Quantitative Biology - Genomics - Abstract
Earlier formulations of the DNA assembly problem were all in the context of perfect assembly; i.e., given a set of reads from a long genome sequence, is it possible to perfectly reconstruct the original sequence? In practice, however, it is very often the case that the read data is not sufficiently rich to permit unambiguous reconstruction of the original sequence. While a natural generalization of the perfect assembly formulation to these cases would be to consider a rate-distortion framework, partial assemblies are usually represented in terms of an assembly graph, making the definition of a distortion measure challenging. In this work, we introduce a distortion function for assembly graphs that can be understood as the logarithm of the number of Eulerian cycles in the assembly graph, each of which correspond to a candidate assembly that could have generated the observed reads. We also introduce an algorithm for the construction of an assembly graph and analyze its performance on real genomes., Comment: To be published at ISIT-2016. 11 pages, 10 figures
- Published
- 2016
42. Robust Estimators in High Dimensions without the Computational Intractability
- Author
-
Diakonikolas, Ilias, Kamath, Gautam, Kane, Daniel, Li, Jerry, Moitra, Ankur, and Stewart, Alistair
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Machine Learning ,Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an $\varepsilon$-fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension-dependent factors in their error guarantees. This raises the following question:Is high-dimensional agnostic distribution learning even possible, algorithmically? In this work, we obtain the first computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions: (1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of spherical Gaussians. Our algorithms achieve error that is independent of the dimension, and in many cases scales nearly-linearly with the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions, that may be applicable to many other problems.
- Published
- 2016
43. Community Recovery in Graphs with Locality
- Author
-
Chen, Yuxin, Kamath, Govinda, Suh, Changho, and Tse, David
- Subjects
Computer Science - Information Theory ,Computer Science - Learning ,Computer Science - Social and Information Networks ,Mathematics - Statistics Theory ,Quantitative Biology - Genomics - Abstract
Motivated by applications in domains such as social networks and computational biology, we study the problem of community recovery in graphs with locality. In this problem, pairwise noisy measurements of whether two nodes are in the same community or different communities come mainly or exclusively from nearby nodes rather than uniformly sampled between all nodes pairs, as in most existing models. We present an algorithm that runs nearly linearly in the number of measurements and which achieves the information theoretic limit for exact recovery., Comment: accepted in part to International Conference on Machine Learning (ICML), 2016
- Published
- 2016
44. Optimal Testing for Properties of Distributions
- Author
-
Acharya, Jayadev, Daskalakis, Constantinos, and Kamath, Gautam
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Information Theory ,Computer Science - Learning ,Mathematics - Statistics Theory - Abstract
Given samples from an unknown distribution $p$, is it possible to distinguish whether $p$ belongs to some class of distributions $\mathcal{C}$ versus $p$ being far from every distribution in $\mathcal{C}$? This fundamental question has received tremendous attention in statistics, focusing primarily on asymptotic analysis, and more recently in information theory and theoretical computer science, where the emphasis has been on small sample size and computational complexity. Nevertheless, even for basic properties of distributions such as monotonicity, log-concavity, unimodality, independence, and monotone-hazard rate, the optimal sample complexity is unknown. We provide a general approach via which we obtain sample-optimal and computationally efficient testers for all these distribution families. At the core of our approach is an algorithm which solves the following problem: Given samples from an unknown distribution $p$, and a known distribution $q$, are $p$ and $q$ close in $\chi^2$-distance, or far in total variation distance? The optimality of our testers is established by providing matching lower bounds with respect to both $n$ and $\varepsilon$. Finally, a necessary building block for our testers and an important byproduct of our work are the first known computationally efficient proper learners for discrete log-concave and monotone hazard rate distributions., Comment: 31 pages, extended abstract appeared as a spotlight in NIPS 2015
- Published
- 2015
45. The two-unicast problem
- Author
-
Kamath, Sudeep, Anantharam, Venkat, Tse, David, and Wang, Chih-Chun
- Subjects
Computer Science - Information Theory - Abstract
We consider the communication capacity of wireline networks for a two-unicast traffic pattern. The network has two sources and two destinations with each source communicating a message to its own destination, subject to the capacity constraints on the directed edges of the network. We propose a simple outer bound for the problem that we call the Generalized Network Sharing (GNS) bound. We show this bound is the tightest edge-cut bound for two-unicast networks and is tight in several bottleneck cases, though it is not tight in general. We also show that the problem of computing the GNS bound is NP-complete. Finally, we show that despite its seeming simplicity, the two-unicast problem is as hard as the most general network coding problem. As a consequence, linear coding is insufficient to achieve capacity for general two-unicast networks, and non-Shannon inequalities are necessary for characterizing capacity of general two-unicast networks., Comment: 23 pages, 22 figures
- Published
- 2015
46. Communication Complexity of Permutation-Invariant Functions
- Author
-
Ghazi, Badih, Kamath, Pritish, and Sudan, Madhu
- Subjects
Computer Science - Computational Complexity ,Computer Science - Information Theory - Abstract
Motivated by the quest for a broader understanding of communication complexity of simple functions, we introduce the class of "permutation-invariant" functions. A partial function $f:\{0,1\}^n \times \{0,1\}^n\to \{0,1,?\}$ is permutation-invariant if for every bijection $\pi:\{1,\ldots,n\} \to \{1,\ldots,n\}$ and every $\mathbf{x}, \mathbf{y} \in \{0,1\}^n$, it is the case that $f(\mathbf{x}, \mathbf{y}) = f(\mathbf{x}^{\pi}, \mathbf{y}^{\pi})$. Most of the commonly studied functions in communication complexity are permutation-invariant. For such functions, we present a simple complexity measure (computable in time polynomial in $n$ given an implicit description of $f$) that describes their communication complexity up to polynomial factors and up to an additive error that is logarithmic in the input size. This gives a coarse taxonomy of the communication complexity of simple functions. Our work highlights the role of the well-known lower bounds of functions such as 'Set-Disjointness' and 'Indexing', while complementing them with the relatively lesser-known upper bounds for 'Gap-Inner-Product' (from the sketching literature) and 'Sparse-Gap-Inner-Product' (from the recent work of Canonne et al. [ITCS 2015]). We also present consequences to the study of communication complexity with imperfectly shared randomness where we show that for total permutation-invariant functions, imperfectly shared randomness results in only a polynomial blow-up in communication complexity after an additive $O(\log \log n)$ overhead.
- Published
- 2015
47. On Non-Interactive Simulation of Joint Distributions
- Author
-
Kamath, Sudeep and Anantharam, Venkat
- Subjects
Computer Science - Information Theory - Abstract
We consider the following non-interactive simulation problem: Alice and Bob observe sequences $X^n$ and $Y^n$ respectively where $\{(X_i, Y_i)\}_{i=1}^n$ are drawn i.i.d. from $P(x,y),$ and they output $U$ and $V$ respectively which is required to have a joint law that is close in total variation to a specified $Q(u,v).$ It is known that the maximal correlation of $U$ and $V$ must necessarily be no bigger than that of $X$ and $Y$ if this is to be possible. Our main contribution is to bring hypercontractivity to bear as a tool on this problem. In particular, we show that if $P(x,y)$ is the doubly symmetric binary source, then hypercontractivity provides stronger impossibility results than maximal correlation. Finally, we extend these tools to provide impossibility results for the $k$-agent version of this problem., Comment: 25 pages, 13 figures
- Published
- 2015
48. Optimal Haplotype Assembly from High-Throughput Mate-Pair Reads
- Author
-
Kamath, Govinda M., Şaşoğlu, Eren, and Tse, David
- Subjects
Computer Science - Information Theory ,Computer Science - Computational Engineering, Finance, and Science ,Quantitative Biology - Genomics ,Statistics - Applications - Abstract
Humans have $23$ pairs of homologous chromosomes. The homologous pairs are almost identical pairs of chromosomes. For the most part, differences in homologous chromosome occur at certain documented positions called single nucleotide polymorphisms (SNPs). A haplotype of an individual is the pair of sequences of SNPs on the two homologous chromosomes. In this paper, we study the problem of inferring haplotypes of individuals from mate-pair reads of their genome. We give a simple formula for the coverage needed for haplotype assembly, under a generative model. The analysis here leverages connections of this problem with decoding convolutional codes., Comment: 10 pages, 4 figures, Submitted to ISIT 2015
- Published
- 2015
49. On Maximal Correlation, Hypercontractivity, and the Data Processing Inequality studied by Erkip and Cover
- Author
-
Anantharam, Venkat, Gohari, Amin, Kamath, Sudeep, and Nair, Chandra
- Subjects
Computer Science - Information Theory - Abstract
In this paper we provide a new geometric characterization of the Hirschfeld-Gebelein-R\'{e}nyi maximal correlation of a pair of random $(X,Y)$, as well as of the chordal slope of the nontrivial boundary of the hypercontractivity ribbon of $(X,Y)$ at infinity. The new characterizations lead to simple proofs for some of the known facts about these quantities. We also provide a counterexample to a data processing inequality claimed by Erkip and Cover, and find the correct tight constant for this kind of inequality., Comment: 11 pages
- Published
- 2013
50. Explicit MBR All-Symbol Locality Codes
- Author
-
Kamath, Govinda M., Silberstein, Natalia, Prakash, N., Rawat, Ankit S., Lalitha, V., Koyluoglu, O. Ozan, Kumar, P. Vijay, and Vishwanath, Sriram
- Subjects
Computer Science - Information Theory - Abstract
Node failures are inevitable in distributed storage systems (DSS). To enable efficient repair when faced with such failures, two main techniques are known: Regenerating codes, i.e., codes that minimize the total repair bandwidth; and codes with locality, which minimize the number of nodes participating in the repair process. This paper focuses on regenerating codes with locality, using pre-coding based on Gabidulin codes, and presents constructions that utilize minimum bandwidth regenerating (MBR) local codes. The constructions achieve maximum resilience (i.e., optimal minimum distance) and have maximum capacity (i.e., maximum rate). Finally, the same pre-coding mechanism can be combined with a subclass of fractional-repetition codes to enable maximum resilience and repair-by-transfer simultaneously.
- Published
- 2013
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.