Author: "Xu, JiaMing" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xu, JiaMing"' showing total 77 results

Start Over Author "Xu, JiaMing" Database arXiv

77 results on '"Xu, JiaMing"'

1. Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

Author: Li, Jinhao, Xu, Jiaming, Huang, Shan, Chen, Yonghua, Li, Wen, Liu, Jun, Lian, Yaoxiu, Pan, Jiayi, Ding, Li, Zhou, Hao, Wang, Yu, and Dai, Guohao
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms, which can point to the future trends and potential developments of generative LLMs and hardware technology for edge-side scenarios., Comment: 43 pages, 15 figures
Published: 2024

2. MARCA: Mamba Accelerator with ReConfigurable Architecture

Author: Li, Jinhao, Huang, Shan, Xu, Jiaming, Liu, Jun, Ding, Li, Xu, Ningyi, and Dai, Guohao
Subjects: Computer Science - Hardware Architecture, Computer Science - Artificial Intelligence
Abstract: We propose a Mamba accelerator with reconfigurable architecture, MARCA.We propose three novel approaches in this paper. (1) Reduction alternative PE array architecture for both linear and element-wise operations. For linear operations, the reduction tree connected to PE arrays is enabled and executes the reduction operation. For element-wise operations, the reduction tree is disabled and the output bypasses. (2) Reusable nonlinear function unit based on the reconfigurable PE. We decompose the exponential function into element-wise operations and a shift operation by a fast biased exponential algorithm, and the activation function (SiLU) into a range detection and element-wise operations by a piecewise approximation algorithm. Thus, the reconfigurable PEs are reused to execute nonlinear functions with negligible accuracy loss.(3) Intra-operation and inter-operation buffer management strategy. We propose intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations. We conduct extensive experiments on Mamba model families with different sizes.MARCA achieves up to 463.22$\times$/11.66$\times$ speedup and up to 9761.42$\times$/242.52$\times$ energy efficiency compared to Intel Xeon 8358P CPU and NVIDIA Tesla A100 GPU implementations, respectively., Comment: 9 pages, 10 figures, accepted by ICCAD 2024. arXiv admin note: text overlap with arXiv:2001.02514 by other authors
Published: 2024

3. Collaborative Learning with Shared Linear Representations: Statistical Rates and Optimal Algorithms

Author: Niu, Xiaochun, Su, Lili, Xu, Jiaming, and Yang, Pengkun
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Collaborative learning enables multiple clients to learn shared feature representations across local data distributions, with the goal of improving model performance and reducing overall sample complexity. While empirical evidence shows the success of collaborative learning, a theoretical understanding of the optimal statistical rate remains lacking, even in linear settings. In this paper, we identify the optimal statistical rate when clients share a common low-dimensional linear representation. Specifically, we design a spectral estimator with local averaging that approximates the optimal solution to the least squares problem. We establish a minimax lower bound to demonstrate that our estimator achieves the optimal error rate. Notably, the optimal rate reveals two distinct phases. In typical cases, our rate matches the standard rate based on the parameter counting of the linear representation. However, a statistical penalty arises in collaborative learning when there are too many clients or when local datasets are relatively small. Furthermore, our results, unlike existing ones, show that, at a system level, collaboration always reduces overall sample complexity compared to independent client learning. In addition, at an individual level, we provide a more precise characterization of when collaboration benefits a client in transfer learning and private fine-tuning.
Published: 2024

4. A Survey on Efficient Inference for Large Language Models

Author: Zhou, Zixuan, Ning, Xuefei, Hong, Ke, Fu, Tianyu, Xu, Jiaming, Li, Shiyao, Lou, Yuming, Wang, Luning, Yuan, Zhihang, Li, Xiuhong, Yan, Shengen, Dai, Guohao, Zhang, Xiao-Ping, Dong, Yuhan, and Wang, Yu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.
Published: 2024

5. Sharp Information-Theoretic Thresholds for Shuffled Linear Regression

Author: Lufkin, Leon, Wu, Yihong, and Xu, Jiaming
Subjects: Mathematics - Statistics Theory
Abstract: This paper studies the problem of shuffled linear regression, where the correspondence between predictors and responses in a linear model is obfuscated by a latent permutation. Specifically, we consider the model $y = \Pi_* X \beta_* + w$, where $X$ is an $n \times d$ standard Gaussian design matrix, $w$ is Gaussian noise with entrywise variance $\sigma^2$, $\Pi_*$ is an unknown $n \times n$ permutation matrix, and $\beta_*$ is the regression coefficient, also unknown. Previous work has shown that, in the large $n$-limit, the minimal signal-to-noise ratio ($\mathsf{SNR}$), $\lVert \beta_* \rVert^2/\sigma^2$, for recovering the unknown permutation exactly with high probability is between $n^2$ and $n^C$ for some absolute constant $C$ and the sharp threshold is unknown even for $d=1$. We show that this threshold is precisely $\mathsf{SNR} = n^4$ for exact recovery throughout the sublinear regime $d=o(n)$. As a by-product of our analysis, we also determine the sharp threshold of almost exact recovery to be $\mathsf{SNR} = n^2$, where all but a vanishing fraction of the permutation is reconstructed., Comment: 18 pages (9 main, 1 references, 8 appendix)
Published: 2024

6. Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Author: Li, Jinhao, Xu, Jiaming, Li, Shiyao, Huang, Shan, Liu, Jun, Lian, Yaoxiu, and Dai, Guohao
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory consumption. Applying 2-bit single-precision weight quantization brings >3% accuracy loss, so the state-of-the-art methods use mixed-precision methods for LLMs (e.g. Llama2-7b, etc.) to improve the accuracy. However, challenges still exist: (1) Uneven distribution in weight matrix. (2) Large speed degradation by adding sparse outliers. (3) Time-consuming dequantization operations on GPUs. To tackle these challenges and enable fast and efficient LLM inference on GPUs, we propose the following techniques in this paper. (1) Intra-weight mixed-precision quantization. (2) Exclusive 2-bit sparse outlier with minimum speed degradation. (3) Asynchronous dequantization. We conduct extensive experiments on different model families (e.g. Llama3, etc.) and model sizes. We achieve 2.91-bit for each weight considering all scales/zeros for different models with negligible loss. As a result, with our 2/4/16 mixed-precision quantization for each weight matrix and asynchronous dequantization during inference, our design achieves an end-to-end speedup for Llama2-7b is 1.74x over the original model, and we reduce both runtime cost and total cost by up to 2.53x and 2.29x with less GPU requirements.
Published: 2023

7. FlashDecoding++: Faster Large Language Model Inference on GPUs

Author: Hong, Ke, Dai, Guohao, Xu, Jiaming, Mao, Qiuli, Li, Xiuhong, Liu, Jun, Chen, Kangdi, Dong, Yuhan, and Wang, Yu
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ~20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and >50% performance loss after padding zeros in previous designs. (3) Performance loss due to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference. We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced. (3) Heuristic dataflow with hardware resource adaptation. FlashDecoding++ heuristically optimizes dataflow using different hardware resource considering input dynamics. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.
Published: 2023

8. Federated Learning in the Presence of Adversarial Client Unavailability

Author: Su, Lili, Xiang, Ming, Xu, Jiaming, and Yang, Pengkun
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Federated learning is a decentralized machine learning framework that enables collaborative model training without revealing raw data. Due to the diverse hardware and software limitations, a client may not always be available for the computation requests from the parameter server. An emerging line of research is devoted to tackling arbitrary client unavailability. However, existing work still imposes structural assumptions on the unavailability patterns, impeding their applicability in challenging scenarios wherein the unavailability patterns are beyond the control of the parameter server. Moreover, in harsh environments like battlefields, adversaries can selectively and adaptively silence specific clients. In this paper, we relax the structural assumptions and consider adversarial client unavailability. To quantify the degrees of client unavailability, we use the notion of $\epsilon$-adversary dropout fraction. We show that simple variants of FedAvg or FedProx, albeit completely agnostic to $\epsilon$, converge to an estimation error on the order of $\epsilon (G^2 + \sigma^2)$ for non-convex global objectives and $\epsilon(G^2 + \sigma^2)/\mu^2$ for $\mu$ strongly convex global objectives, where $G$ is a heterogeneity parameter and $\sigma^2$ is the noise level. Conversely, we prove that any algorithm has to suffer an estimation error of at least $\epsilon (G^2 + \sigma^2)/8$ and $\epsilon(G^2 + \sigma^2)/(8\mu^2)$ for non-convex global objectives and $\mu$-strongly convex global objectives. Furthermore, the convergence speeds of the FedAvg or FedProx variants are $O(1/\sqrt{T})$ for non-convex objectives and $O(1/T)$ for strongly-convex objectives, both of which are the best possible for any first-order method that only has access to noisy gradients.
Published: 2023

9. Rectangular matrix additions in low and high temperatures

Author: Xu, Jiaming
Subjects: Mathematics - Probability, Mathematical Physics, Mathematics - Combinatorics, Mathematics - Operator Algebras
Abstract: We study the addition of two random independent $M\times N$ rectangular random matrices with invariant distributions in two limit regimes, where the parameter beta (inverse temperature) goes to infinity and zero. In low temperature regime the random singular values of the sum concentrate at deterministic points, while in high temperature regime we obtain a Law of Large Numbers for the empirical measures. As a consequence, we deliver a duality between low and high temperatures. Our proof uses the type BC Bessel function as characteristic function of rectangular matrices, and through the analysis of this function we introduce a new family of cumulants, that linearize the addition in high temperature limit, and degenerate to the classical or free cumulants in special cases., Comment: 68 pages, 2 figures, more references are added
Published: 2023

10. Random graph matching at Otter's threshold via counting chandeliers

Author: Mao, Cheng, Wu, Yihong, Xu, Jiaming, and Yu, Sophie H.
Subjects: Computer Science - Data Structures and Algorithms, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: We propose an efficient algorithm for graph matching based on similarity scores constructed from counting a certain family of weighted trees rooted at each vertex. For two Erd\H{o}s-R\'enyi graphs $\mathcal{G}(n,q)$ whose edges are correlated through a latent vertex correspondence, we show that this algorithm correctly matches all but a vanishing fraction of the vertices with high probability, provided that $nq\to\infty$ and the edge correlation coefficient $\rho$ satisfies $\rho^2>\alpha \approx 0.338$, where $\alpha$ is Otter's tree-counting constant. Moreover, this almost exact matching can be made exact under an extra condition that is information-theoretically necessary. This is the first polynomial-time graph matching algorithm that succeeds at an explicit constant correlation and applies to both sparse and dense graphs. In comparison, previous methods either require $\rho=1-o(1)$ or are restricted to sparse graphs. The crux of the algorithm is a carefully curated family of rooted trees called chandeliers, which allows effective extraction of the graph correlation from the counts of the same tree while suppressing the undesirable correlation between those of different trees.
Published: 2022

11. Random sorting networks: edge limit

Author: Gorin, Vadim and Xu, Jiaming
Subjects: Mathematics - Probability, Mathematics - Combinatorics
Abstract: A sorting network is a shortest path from $12\dots n$ to $n\dots 21$ in the Cayley graph of the symmetric group $\mathfrak S_n$ spanned by adjacent transpositions. The paper computes the edge local limit of the uniformly random sorting networks as $n\to\infty$. We find the asymptotic distribution of the first occurrence of a given swap $(k,k+1)$ and identify it with the law of the smallest positive eigenvalue of a $2k\times 2k$ aGUE (an aGUE matrix has purely imaginary Gaussian entries that are independently distributed subject to skew-symmetry). Next, we give two different formal definitions of a spacing -- the time distance between the occurrence of a given swap $(k,k+1)$ in a uniformly random sorting network. Two definitions lead to two different expressions for the asymptotic laws expressed in terms of derivatives of Fredholm determinants., Comment: 23 pages, 6 figues. Final version, to appear in Annales de l'Institut Henri Poincare: Probability and Statistics
Published: 2022

12. Global Convergence of Federated Learning for Mixed Regression

Author: Su, Lili, Xu, Jiaming, and Yang, Pengkun
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: This paper studies the problem of model training under Federated Learning when clients exhibit cluster structure. We contextualize this problem in mixed regression, where each client has limited local data generated from one of $k$ unknown regression models. We design an algorithm that achieves global convergence from any initialization, and works even when local data volume is highly unbalanced -- there could exist clients that contain $O(1)$ data points only. Our algorithm first runs moment descent on a few anchor clients (each with $\tilde{\Omega}(k)$ data points) to obtain coarse model estimates. Then each client alternately estimates its cluster labels and refines the model estimates based on FedAvg or FedProx. A key innovation in our analysis is a uniform estimate on the clustering errors, which we prove by bounding the VC dimension of general polynomial concept classes based on the theory of algebraic geometry.
Published: 2022

13. SeedGNN: Graph Neural Networks for Supervised Seeded Graph Matching

Author: Yu, Liren, Xu, Jiaming, and Lin, Xiaojun
Subjects: Computer Science - Machine Learning
Abstract: There is a growing interest in designing Graph Neural Networks (GNNs) for seeded graph matching, which aims to match two unlabeled graphs using only topological information and a small set of seed nodes. However, most previous GNNs for this task use a semi-supervised approach, which requires a large number of seeds and cannot learn knowledge that is transferable to unseen graphs. In contrast, this paper proposes a new supervised approach that can learn from a training set how to match unseen graphs with only a few seeds. Our SeedGNN architecture incorporates several novel designs, inspired by theoretical studies of seeded graph matching: 1) it can learn to compute and use witness-like information from different hops, in a way that can be generalized to graphs of different sizes; 2) it can use easily-matched node-pairs as new seeds to improve the matching in subsequent layers. We evaluate SeedGNN on synthetic and real-world graphs and demonstrate significant performance improvements over both non-learning and learning algorithms in the existing literature. Furthermore, our experiments confirm that the knowledge learned by SeedGNN from training graphs can be generalized to test graphs of different sizes and categories.
Published: 2022

14. Random Graph Matching in Geometric Models: the Case of Complete Graphs

Author: Wang, Haoyu, Wu, Yihong, Xu, Jiaming, and Yolou, Israel
Subjects: Mathematics - Statistics Theory, Mathematics - Probability, Statistics - Machine Learning
Abstract: This paper studies the problem of matching two complete graphs with edge weights correlated through latent geometries, extending a recent line of research on random graph matching with independent edge weights to geometric models. Specifically, given a random permutation $\pi^*$ on $[n]$ and $n$ iid pairs of correlated Gaussian vectors $\{X_{\pi^*(i)}, Y_i\}$ in $\mathbb{R}^d$ with noise parameter $\sigma$, the edge weights are given by $A_{ij}=\kappa(X_i,X_j)$ and $B_{ij}=\kappa(Y_i,Y_j)$ for some link function $\kappa$. The goal is to recover the hidden vertex correspondence $\pi^*$ based on the observation of $A$ and $B$. We focus on the dot-product model with $\kappa(x,y)=\langle x, y \rangle$ and Euclidean distance model with $\kappa(x,y)=\|x-y\|^2$, in the low-dimensional regime of $d=o(\log n)$ wherein the underlying geometric structures are most evident. We derive an approximate maximum likelihood estimator, which provably achieves, with high probability, perfect recovery of $\pi^*$ when $\sigma=o(n^{-2/d})$ and almost perfect recovery with a vanishing fraction of errors when $\sigma=o(n^{-1/d})$. Furthermore, these conditions are shown to be information-theoretically optimal even when the latent coordinates $\{X_i\}$ and $\{Y_i\}$ are observed, complementing the recent results of [DCK19] and [KNW22] in geometric models of the planted bipartite matching problem. As a side discovery, we show that the celebrated spectral algorithm of [Ume88] emerges as a further approximation to the maximum likelihood in the geometric model., Comment: Corrected some typos
Published: 2022

15. LiMuSE: Lightweight Multi-modal Speaker Extraction

Author: Liu, Qinghua, Huang, Yating, Hao, Yunzhe, Xu, Jiaming, and Xu, Bo
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Multi-modal cues, including spatial information, facial expression and voiceprint, are introduced to the speech separation and speaker extraction tasks to serve as complementary information to achieve better performance. However, the introduction of these cues brings about an increasing number of parameters and model complexity, which makes it harder to deploy these models on resource-constrained devices. In this paper, we alleviate the aforementioned problem by proposing a Lightweight Multi-modal framework for Speaker Extraction (LiMuSE). We propose to use GC-equipped TCN, which incorporates Group Communication (GC) and Temporal Convolutional Network (TCN) in the Context Codec module, the audio block and the fusion block. The experiments on the MC_GRID dataset demonstrate that LiMuSE achieves on par or better performance with a much smaller number of parameters and less model complexity. We further investigate the impacts of the quantization of LiMuSE. Our code and dataset are provided., Comment: Accepted to IEEE SLT 2022
Published: 2021

16. Testing network correlation efficiently via counting trees

Author: Mao, Cheng, Wu, Yihong, Xu, Jiaming, and Yu, Sophie H.
Subjects: Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: We propose a new procedure for testing whether two networks are edge-correlated through some latent vertex correspondence. The test statistic is based on counting the co-occurrences of signed trees for a family of non-isomorphic trees. When the two networks are Erd\H{o}s-R\'enyi random graphs $\mathcal{G}(n,q)$ that are either independent or correlated with correlation coefficient $\rho$, our test runs in $n^{2+o(1)}$ time and succeeds with high probability as $n\to\infty$, provided that $n\min\{q,1-q\} \ge n^{-o(1)}$ and $\rho^2>\alpha \approx 0.338$, where $\alpha$ is Otter's constant so that the number of unlabeled trees with $K$ edges grows as $(1/\alpha)^K$. This significantly improves the prior work in terms of statistical accuracy, running time, and graph sparsity.
Published: 2021

17. A Non-parametric View of FedAvg and FedProx: Beyond Stationary Points

Author: Su, Lili, Xu, Jiaming, and Yang, Pengkun
Subjects: Statistics - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: Federated Learning (FL) is a promising decentralized learning framework and has great potentials in privacy preservation and in lowering the computation load at the cloud. Recent work showed that FedAvg and FedProx - the two widely-adopted FL algorithms - fail to reach the stationary points of the global optimization objective even for homogeneous linear regression problems. Further, it is concerned that the common model learned might not generalize well locally at all in the presence of heterogeneity. In this paper, we analyze the convergence and statistical efficiency of FedAvg and FedProx, addressing the above two concerns. Our analysis is based on the standard non-parametric regression in a reproducing kernel Hilbert space (RKHS), and allows for heterogeneous local data distributions and unbalanced local datasets. We prove that the estimation errors, measured in either the empirical norm or the RKHS norm, decay with a rate of 1/t in general and exponentially for finite-rank kernels. In certain heterogeneous settings, these upper bounds also imply that both FedAvg and FedProx achieve the optimal error rate. To further analytically quantify the impact of the heterogeneity at each client, we propose and characterize a novel notion-federation gain, defined as the reduction of the estimation error for a client to join the FL. We discover that when the data heterogeneity is moderate, a client with limited local data can benefit from a common model with a large federation gain. Numerical experiments further corroborate our theoretical findings.
Published: 2021

18. WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Author: Hao, Yunzhe, Xu, Jiaming, Zhang, Peng, and Xu, Bo
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information. However, no one cares for the cue of sound onset, which has been emphasized in the auditory scene analysis and psychology. Inspired by it, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task. We further extended to the onset/offset cues and got performance improvement. From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection. We also combined voiceprint with onset/offset cues. Voiceprint models voice characteristics of the target while onset/offset models the start/end information of the speech. From the perspective of auditory scene analysis, the combination of two perception cues can promote the integrity of the auditory object. The experiment results are also close to state-of-the-art performance, using nearly half of the parameters. We hope that this work will inspire communities of speech processing and psychology, and contribute to communication between them. Our code will be available in https://github.com/aispeech-lab/wase/., Comment: Accepted by ICASSP 2021
Published: 2021

19. One-pass Stochastic Gradient Descent in Overparametrized Two-layer Neural Networks

Author: Xu, Jiaming and Zhu, Hanjing
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous works assume that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid's inequality.
Published: 2021

20. MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation

Author: Li, Xiyun, Xu, Yong, Yu, Meng, Zhang, Shi-Xiong, Xu, Jiaming, Xu, Bo, and Yu, Dong
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR by replacing the matrix inversion and eigenvalue decomposition with two recurrent neural networks. In this work, we present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention. Temporal-spatial self-attention module is proposed to better learn the beamforming weights from the speech and noise spatial covariance matrices. The temporal self-attention module could help RNN to learn global statistics of covariance matrices. The spatial self-attention module is designed to attend on the cross-channel correlation in the covariance matrices. Furthermore, a multi-channel input with multi-speaker directional features and multi-speaker speech separation outputs (MIMO) model is developed to improve the inference efficiency. The evaluations demonstrate that our proposed MIMO self-attentive RNN beamformer improves both the automatic speech recognition (ASR) accuracy and the perceptual estimation of speech quality (PESQ) against prior arts.
Published: 2021

21. The planted matching problem: Sharp threshold and infinite-order phase transition

Author: Ding, Jian, Wu, Yihong, Xu, Jiaming, and Yang, Dana
Subjects: Mathematics - Statistics Theory, Computer Science - Information Theory, Mathematics - Combinatorics, Mathematics - Probability, Statistics - Machine Learning
Abstract: We study the problem of reconstructing a perfect matching $M^*$ hidden in a randomly weighted $n\times n$ bipartite graph. The edge set includes every node pair in $M^*$ and each of the $n(n-1)$ node pairs not in $M^*$ independently with probability $d/n$. The weight of each edge $e$ is independently drawn from the distribution $\mathcal{P}$ if $e \in M^*$ and from $\mathcal{Q}$ if $e \notin M^*$. We show that if $\sqrt{d} B(\mathcal{P},\mathcal{Q}) \le 1$, where $B(\mathcal{P},\mathcal{Q})$ stands for the Bhattacharyya coefficient, the reconstruction error (average fraction of misclassified edges) of the maximum likelihood estimator of $M^*$ converges to $0$ as $n\to \infty$. Conversely, if $\sqrt{d} B(\mathcal{P},\mathcal{Q}) \ge 1+\epsilon$ for an arbitrarily small constant $\epsilon>0$, the reconstruction error for any estimator is shown to be bounded away from $0$ under both the sparse and dense model, resolving the conjecture in [Moharrami et al. 2019, Semerjian et al. 2020]. Furthermore, in the special case of complete exponentially weighted graph with $d=n$, $\mathcal{P}=\exp(\lambda)$, and $\mathcal{Q}=\exp(1/n)$, for which the sharp threshold simplifies to $\lambda=4$, we prove that when $\lambda \le 4-\epsilon$, the optimal reconstruction error is $\exp\left( - \Theta(1/\sqrt{\epsilon}) \right)$, confirming the conjectured infinite-order phase transition in [Semerjian et al. 2020].
Published: 2021

22. The Power of $D$-hops in Matching Power-Law Graphs

Author: Yu, Liren, Xu, Jiaming, and Lin, Xiaojun
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning
Abstract: This paper studies seeded graph matching for power-law graphs. Assume that two edge-correlated graphs are independently edge-sampled from a common parent graph with a power-law degree distribution. A set of correctly matched vertex-pairs is chosen at random and revealed as initial seeds. Our goal is to use the seeds to recover the remaining latent vertex correspondence between the two graphs. Departing from the existing approaches that focus on the use of high-degree seeds in $1$-hop neighborhoods, we develop an efficient algorithm that exploits the low-degree seeds in suitably-defined $D$-hop neighborhoods. Specifically, we first match a set of vertex-pairs with appropriate degrees (which we refer to as the first slice) based on the number of low-degree seeds in their $D$-hop neighborhoods. This significantly reduces the number of initial seeds needed to trigger a cascading process to match the rest of the graphs. Under the Chung-Lu random graph model with $n$ vertices, max degree $\Theta(\sqrt{n})$, and the power-law exponent $2<\beta<3$, we show that as soon as $D> \frac{4-\beta}{3-\beta}$, by optimally choosing the first slice, with high probability our algorithm can correctly match a constant fraction of the true pairs without any error, provided with only $\Omega((\log n)^{4-\beta})$ initial seeds. Our result achieves an exponential reduction in the seed size requirement, as the best previously known result requires $n^{1/2+\epsilon}$ seeds (for any small constant $\epsilon>0$). Performance evaluation with synthetic and real data further corroborates the improved performance of our algorithm.
Published: 2021

23. Learner-Private Convex Optimization

Author: Xu, Jiaming, Xu, Kuang, and Yang, Dana
Subjects: Statistics - Machine Learning, Computer Science - Cryptography and Security, Computer Science - Machine Learning, Mathematics - Optimization and Control, Mathematics - Statistics Theory
Abstract: Convex optimization with feedback is a framework where a learner relies on iterative queries and feedback to arrive at the minimizer of a convex function. It has gained considerable popularity thanks to its scalability in large-scale optimization and machine learning. The repeated interactions, however, expose the learner to privacy risks from eavesdropping adversaries that observe the submitted queries. In this paper, we study how to optimally obfuscate the learner's queries in convex optimization with first-order feedback, so that their learned optimal value is provably difficult to estimate for an eavesdropping adversary. We consider two formulations of learner privacy: a Bayesian formulation in which the convex function is drawn randomly, and a minimax formulation in which the function is fixed and the adversary's probability of error is measured with respect to a minimax criterion. Suppose that the learner wishes to ensure the adversary cannot estimate accurately with probability greater than $1/L$ for some $L>0$. Our main results show that the query complexity overhead is additive in $L$ in the minimax formulation, but multiplicative in $L$ in the Bayesian formulation. Compared to existing learner-private sequential learning models with binary feedback, our results apply to the significantly richer family of general convex functions with full-gradient feedback. Our proofs learn on tools from the theory of Dirichlet processes, as well as a novel strategy designed for measuring information leakage under a full-gradient oracle.
Published: 2021

24. Speaker and Direction Inferred Dual-channel Speech Separation

Author: Li, Chenxing, Xu, Jiaming, Mesgarani, Nima, and Xu, Bo
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most speech separation methods, trying to separate all channel sources simultaneously, are still far from having enough general- ization capabilities for real scenarios where the number of input sounds is usually uncertain and even dynamic. In this work, we employ ideas from auditory attention with two ears and propose a speaker and direction inferred speech separation network (dubbed SDNet) to solve the cocktail party problem. Specifically, our SDNet first parses out the respective perceptual representations with their speaker and direction characteristics from the mixture of the scene in a sequential manner. Then, the perceptual representations are utilized to attend to each corresponding speech. Our model gener- ates more precise perceptual representations with the help of spatial features and successfully deals with the problem of the unknown number of sources and the selection of outputs. The experiments on standard fully-overlapped speech separation benchmarks, WSJ0- 2mix, WSJ0-3mix, and WSJ0-2&3mix, show the effectiveness, and our method achieves SDR improvements of 25.31 dB, 17.26 dB, and 21.56 dB under anechoic settings. Our codes will be released at https://github.com/aispeech-lab/SDNet., Comment: Accepted by ICASSP 2021
Published: 2021

25. Settling the Sharp Reconstruction Thresholds of Random Graph Matching

Author: Wu, Yihong, Xu, Jiaming, and Yu, Sophie H.
Subjects: Mathematics - Statistics Theory, Computer Science - Information Theory, Statistics - Machine Learning, 94A15, 62B10, 68Q87, 05C80, 05C60
Abstract: This paper studies the problem of recovering the hidden vertex correspondence between two edge-correlated random graphs. We focus on the Gaussian model where the two graphs are complete graphs with correlated Gaussian weights and the Erd\H{o}s-R\'enyi model where the two graphs are subsampled from a common parent Erd\H{o}s-R\'enyi graph $\mathcal{G}(n,p)$. For dense graphs with $p=n^{-o(1)}$, we prove that there exists a sharp threshold, above which one can correctly match all but a vanishing fraction of vertices and below which correctly matching any positive fraction is impossible, a phenomenon known as the "all-or-nothing" phase transition. Even more strikingly, in the Gaussian setting, above the threshold all vertices can be exactly matched with high probability. In contrast, for sparse Erd\H{o}s-R\'enyi graphs with $p=n^{-\Theta(1)}$, we show that the all-or-nothing phenomenon no longer holds and we determine the thresholds up to a constant factor. Along the way, we also derive the sharp threshold for exact recovery, sharpening the existing results in Erd\H{o}s-R\'enyi graphs. The proof of the negative results builds upon a tight characterization of the mutual information based on the truncated second-moment computation and an "area theorem" that relates the mutual information to the integral of the reconstruction error. The positive results follows from a tight analysis of the maximum likelihood estimator that takes into account the cycle structure of the induced permutation on the edges.
Published: 2021

26. Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

Author: Zhang, Peng, Xu, Jiaming, shi, Jing, Hao, Yunzhe, and Xu, Bo
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing, I.2.m
Abstract: Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. Although audio-only approaches achieve satisfactory performance, they build on a strategy to handle the predefined conditions, limiting their application in the complex auditory scene. Towards the cocktail party problem, we propose a novel audio-visual speech separation model. In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem. To improve our model's generalization ability to unknown speakers, we extract speech-related visual features from visual inputs explicitly by the adversarially disentangled method, and use this feature to assist speech separation. Besides, the time-domain approach is adopted, which could avoid the phase reconstruction problem existing in the time-frequency domain models. To compare our model's performance with other models, we create two benchmark datasets of 2-speaker mixture from GRID and TCDTIMIT audio-visual datasets. Through a series of experiments, our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models., Comment: 5 pages, 3 figures
Published: 2020

27. Testing correlation of unlabeled random graphs

Author: Wu, Yihong, Xu, Jiaming, and Yu, Sophie H.
Subjects: Mathematics - Statistics Theory, Mathematics - Combinatorics, Mathematics - Probability, Statistics - Machine Learning
Abstract: We study the problem of detecting the edge correlation between two random graphs with $n$ unlabeled nodes. This is formalized as a hypothesis testing problem, where under the null hypothesis, the two graphs are independently generated; under the alternative, the two graphs are edge-correlated under some latent node correspondence, but have the same marginal distributions as the null. For both Gaussian-weighted complete graphs and dense Erd\H{o}s-R\'enyi graphs (with edge probability $n^{-o(1)}$), we determine the sharp threshold at which the optimal testing error probability exhibits a phase transition from zero to one as $n\to \infty$. For sparse Erd\H{o}s-R\'enyi graphs with edge probability $n^{-\Omega(1)}$, we determine the threshold within a constant factor. The proof of the impossibility results is an application of the conditional second-moment method, where we bound the truncated second moment of the likelihood ratio by carefully conditioning on the typical behavior of the intersection graph (consisting of edges in both observed graphs) and taking into account the cycle structure of the induced random permutation on the edges. Notably, in the sparse regime, this is accomplished by leveraging the pseudoforest structure of subcritical Erd\H{o}s-R\'enyi graphs and a careful enumeration of subpseudoforests that can be assembled from short orbits of the edge permutation.
Published: 2020

28. Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

Author: Shi, Jing, Chang, Xuankai, Guo, Pengcheng, Watanabe, Shinji, Fujita, Yusuke, Xu, Jiaming, Xu, Bo, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Neural sequence-to-sequence models are well established for applications which can be cast as mapping a single input sequence into a single output sequence. In this work, we focus on one-to-many sequence transduction problems, such as extracting multiple sequential sources from a mixture sequence. We extend the standard sequence-to-sequence model to a conditional multi-sequence model, which explicitly models the relevance between multiple output sequences with the probabilistic chain rule. Based on this extension, our model can conditionally infer output sequences one-by-one by making use of both input and previously-estimated contextual output sequences. This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences. We take speech data as a primary test field to evaluate our methods since the observed speech data is often composed of multiple sources due to the nature of the superposition principle of sound waves. Experiments on several different tasks including speech separation and multi-speaker speech recognition show that our conditional multi-sequence models lead to consistent improvements over the conventional non-conditional models., Comment: 15 pages, 5 figures
Published: 2020

29. Speaker-Conditional Chain Model for Speech Separation and Extraction

Author: Shi, Jing, Xu, Jiaming, Fujita, Yusuke, Watanabe, Shinji, and Xu, Bo
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Speech separation has been extensively explored to tackle the cocktail party problem. However, these studies are still far from having enough generalization capabilities for real scenarios. In this work, we raise a common strategy named Speaker-Conditional Chain Model to process complex speech recordings. In the proposed method, our model first infers the identities of variable numbers of speakers from the observation based on a sequence-to-sequence model. Then, it takes the information from the inferred speakers as conditions to extract their speech sources. With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings. The experiments from standard fully-overlapped speech separation benchmarks show comparable results with prior studies, while our proposed model gets better adaptability for multi-round long recordings., Comment: 7pages, 3 figures
Published: 2020

30. Graph Matching with Partially-Correct Seeds

Author: Yu, Liren, Xu, Jiaming, and Lin, Xiaojun
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Discrete Mathematics, Statistics - Machine Learning
Abstract: Graph matching aims to find the latent vertex correspondence between two edge-correlated graphs and has found numerous applications across different fields. In this paper, we study a seeded graph matching problem, which assumes that a set of seeds, i.e., pre-mapped vertex-pairs, is given in advance. While most previous work requires all seeds to be correct, we focus on the setting where the seeds are partially correct. Specifically, consider two correlated graphs whose edges are sampled independently from a parent \ER graph $\mathcal{G}(n,p)$. A mapping between the vertices of the two graphs is provided as seeds, of which an unknown $\beta$ fraction is correct. We first analyze a simple algorithm that matches vertices based on the number of common seeds in the $1$-hop neighborhoods, and then further propose a new algorithm that uses seeds in the $2$-hop neighborhoods. We establish non-asymptotic performance guarantees of perfect matching for both $1$-hop and $2$-hop algorithms, showing that our new $2$-hop algorithm requires substantially fewer correct seeds than the $1$-hop algorithm when graphs are sparse. Moreover, by combining our new performance guarantees for the $1$-hop and $2$-hop algorithms, we attain the best-known results (in terms of the required fraction of correct seeds) across the entire range of graph sparsity and significantly improve the previous results in \cite{10.14778/2794367.2794371,lubars2018correcting} when $p\ge n^{-5/6}$. For instance, when $p$ is a constant or $p=n^{-3/4}$, we show that only $\Omega(\sqrt{n\log n})$ correct seeds suffice for perfect matching, while the previously best-known results demand $\Omega(n)$ and $\Omega(n^{3/4}\log n)$ correct seeds, respectively. Numerical experiments corroborate our theoretical findings, demonstrating the superiority of our $2$-hop algorithm on a variety of synthetic and real graphs., Comment: 43 pages, 15 figures
Published: 2020

31. All-or-Nothing Phenomena: From Single-Letter to High Dimensions

Author: Reeves, Galen, Xu, Jiaming, and Zadik, Ilias
Subjects: Mathematics - Statistics Theory, Computer Science - Information Theory
Abstract: We consider the linear regression problem of estimating a $p$-dimensional vector $\beta$ from $n$ observations $Y = X \beta + W$, where $\beta_j \stackrel{\text{i.i.d.}}{\sim} \pi$ for a real-valued distribution $\pi$ with zero mean and unit variance, $X_{ij} \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0,1)$, and $W_i\stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2)$. In the asymptotic regime where $n/p \to \delta$ and $ p/ \sigma^2 \to \mathsf{snr}$ for two fixed constants $\delta, \mathsf{snr}\in (0, \infty)$ as $p \to \infty$, the limiting (normalized) minimum mean-squared error (MMSE) has been characterized by the MMSE of an associated single-letter (additive Gaussian scalar) channel. In this paper, we show that if the MMSE function of the single-letter channel converges to a step function, then the limiting MMSE of estimating $\beta$ in the linear regression problem converges to a step function which jumps from $1$ to $0$ at a critical threshold. Moreover, we establish that the limiting mean-squared error of the (MSE-optimal) approximate message passing algorithm also converges to a step function with a larger threshold, providing evidence for the presence of a computational-statistical gap between the two thresholds., Comment: This paper was presented at the 2019 IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Guadeloupe, French West Indies
Published: 2019

32. The Planted Matching Problem: Phase Transitions and Exact Results

Author: Moharrami, Mehrdad, Moore, Cristopher, and Xu, Jiaming
Subjects: Computer Science - Data Structures and Algorithms, Condensed Matter - Statistical Mechanics, Mathematics - Combinatorics
Abstract: We study the problem of recovering a planted matching in randomly weighted complete bipartite graphs $K_{n,n}$. For some unknown perfect matching $M^*$, the weight of an edge is drawn from one distribution $P$ if $e \in M^*$ and another distribution $Q$ if $e \notin M^*$. Our goal is to infer $M^*$, exactly or approximately, from the edge weights. In this paper we take $P=\exp(\lambda)$ and $Q=\exp(1/n)$, in which case the maximum-likelihood estimator of $M^*$ is the minimum-weight matching $M_{\text{min}}$. We obtain precise results on the overlap between $M^*$ and $M_{\text{min}}$, i.e., the fraction of edges they have in common. For $\lambda \ge 4$ we have almost perfect recovery, with overlap $1-o(1)$ with high probability. For $\lambda < 4$ the expected overlap is an explicit function $\alpha(\lambda) < 1$: we compute it by generalizing Aldous' celebrated proof of the $\zeta(2)$ conjecture for the un-planted model, using local weak convergence to relate $K_{n,n}$ to a type of weighted infinite tree, and then deriving a system of differential equations from a message-passing algorithm on this tree.
Published: 2019

33. DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Author: Chen, Feilong, Meng, Fandong, Xu, Jiaming, Li, Peng, Xu, Bo, and Zhou, Jie
Subjects: Computer Science - Computation and Language
Abstract: Visual Dialog is a vision-language task that requires an AI agent to engage in a conversation with humans grounded in an image. It remains a challenging task since it requires the agent to fully understand a given question before making an appropriate response not only from the textual dialog history, but also from the visually-grounded information. While previous models typically leverage single-hop reasoning or single-channel reasoning to deal with this complex multimodal reasoning task, which is intuitively insufficient. In this paper, we thus propose a novel and more powerful Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM. DMRM synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning. Specifically, DMRM maintains a dual channel to obtain the question- and history-aware image features and the question- and image-aware dialog history features by a mulit-hop reasoning process in each channel. Additionally, we also design an effective multimodal attention to further enhance the decoder to generate more accurate responses. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate that the proposed model is effective and outperforms compared models by a significant margin., Comment: Accepted at AAAI 2020
Published: 2019

34. Consistent recovery threshold of hidden nearest neighbor graphs

Author: Ding, Jian, Wu, Yihong, Xu, Jiaming, and Yang, Dana
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Computer Science - Social and Information Networks, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: Motivated by applications such as discovering strong ties in social networks and assembling genome subsequences in biology, we study the problem of recovering a hidden $2k$-nearest neighbor (NN) graph in an $n$-vertex complete graph, whose edge weights are independent and distributed according to $P_n$ for edges in the hidden $2k$-NN graph and $Q_n$ otherwise. The special case of Bernoulli distributions corresponds to a variant of the Watts-Strogatz small-world graph. We focus on two types of asymptotic recovery guarantees as $n\to \infty$: (1) exact recovery: all edges are classified correctly with probability tending to one; (2) almost exact recovery: the expected number of misclassified edges is $o(nk)$. We show that the maximum likelihood estimator achieves (1) exact recovery for $2 \le k \le n^{o(1)}$ if $ \liminf \frac{2\alpha_n}{\log n}>1$; (2) almost exact recovery for $ 1 \le k \le o\left( \frac{\log n}{\log \log n} \right)$ if $\liminf \frac{kD(P_n||Q_n)}{\log n}>1$, where $\alpha_n \triangleq -2 \log \int \sqrt{d P_n d Q_n}$ is the R\'enyi divergence of order $\frac{1}{2}$ and $D(P_n||Q_n)$ is the Kullback-Leibler divergence. Under mild distributional assumptions, these conditions are shown to be information-theoretically necessary for any algorithm to succeed. A key challenge in the analysis is the enumeration of $2k$-NN graphs that differ from the hidden one by a given number of edges.
Published: 2019

35. Optimal query complexity for private sequential learning against eavesdropping

Author: Xu, Jiaming, Xu, Kuang, and Yang, Dana
Subjects: Statistics - Machine Learning, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: We study the query complexity of a learner-private sequential learning problem, motivated by the privacy and security concerns due to eavesdropping that arise in practical applications such as pricing and Federated Learning. A learner tries to estimate an unknown scalar value, by sequentially querying an external database and receiving binary responses; meanwhile, a third-party adversary observes the learner's queries but not the responses. The learner's goal is to design a querying strategy with the minimum number of queries (optimal query complexity) so that she can accurately estimate the true value, while the eavesdropping adversary even with the complete knowledge of her querying strategy cannot. We develop new querying strategies and analytical techniques and use them to prove tight upper and lower bounds on the optimal query complexity. The bounds almost match across the entire parameter range, substantially improving upon existing results. We thus obtain a complete picture of the optimal query complexity as a function of the estimation accuracy and the desired levels of privacy. We also extend the results to sequential learning models in higher dimensions, and where the binary responses are noisy. Our analysis leverages a crucial insight into the nature of private learning problem, which suggests that the query trajectory of an optimal learner can be divided into distinct phases that focus on pure learning versus learning and obfuscation, respectively.
Published: 2019

36. Spectral Graph Matching and Regularized Quadratic Relaxations II: Erd\H{o}s-R\'enyi Graphs and Universality

Author: Fan, Zhou, Mao, Cheng, Wu, Yihong, and Xu, Jiaming
Subjects: Mathematics - Probability, Computer Science - Machine Learning, Mathematics - Spectral Theory, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: We analyze a new spectral graph matching algorithm, GRAph Matching by Pairwise eigen-Alignments (GRAMPA), for recovering the latent vertex correspondence between two unlabeled, edge-correlated weighted graphs. Extending the exact recovery guarantees established in the companion paper for Gaussian weights, in this work, we prove the universality of these guarantees for a general correlated Wigner model. In particular, for two Erd\H{o}s-R\'enyi graphs with edge correlation coefficient $1-\sigma^2$ and average degree at least $\operatorname{polylog}(n)$, we show that GRAMPA exactly recovers the latent vertex correspondence with high probability when $\sigma \lesssim 1/\operatorname{polylog}(n)$. Moreover, we establish a similar guarantee for a variant of GRAMPA, corresponding to a tighter quadratic programming relaxation of the quadratic assignment problem. Our analysis exploits a resolvent representation of the GRAMPA similarity matrix and local laws for the resolvents of sparse Wigner matrices.
Published: 2019

37. Spectral Graph Matching and Regularized Quadratic Relaxations I: The Gaussian Model

Author: Fan, Zhou, Mao, Cheng, Wu, Yihong, and Xu, Jiaming
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Probability, Mathematics - Spectral Theory, Mathematics - Statistics Theory
Abstract: Graph matching aims at finding the vertex correspondence between two unlabeled graphs that maximizes the total edge weight correlation. This amounts to solving a computationally intractable quadratic assignment problem. In this paper we propose a new spectral method, GRAph Matching by Pairwise eigen-Alignments (GRAMPA). Departing from prior spectral approaches that only compare top eigenvectors, or eigenvectors of the same order, GRAMPA first constructs a similarity matrix as a weighted sum of outer products between all pairs of eigenvectors of the two graphs, with weights given by a Cauchy kernel applied to the separation of the corresponding eigenvalues, then outputs a matching by a simple rounding procedure. The similarity matrix can also be interpreted as the solution to a regularized quadratic programming relaxation of the quadratic assignment problem. For the Gaussian Wigner model in which two complete graphs on $n$ vertices have Gaussian edge weights with correlation coefficient $1-\sigma^2$, we show that GRAMPA exactly recovers the correct vertex correspondence with high probability when $\sigma = O(\frac{1}{\log n})$. This matches the state of the art of polynomial-time algorithms, and significantly improves over existing spectral methods which require $\sigma$ to be polynomially small in $n$. The superiority of GRAMPA is also demonstrated on a variety of synthetic and real datasets, in terms of both statistical accuracy and computational efficiency. Universality results, including similar guarantees for dense and sparse Erd\H{o}s-R\'{e}nyi graphs, are deferred to the companion paper.
Published: 2019

38. POG: Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion

Author: Chen, Wen, Huang, Pipei, Xu, Jiaming, Guo, Xin, Guo, Cheng, Sun, Fei, Li, Chao, Pfadler, Andreas, Zhao, Huan, and Zhao, Binqiang
Subjects: Computer Science - Information Retrieval
Abstract: Increasing demand for fashion recommendation raises a lot of challenges for online shopping platforms and fashion communities. In particular, there exist two requirements for fashion outfit recommendation: the Compatibility of the generated fashion outfits, and the Personalization in the recommendation process. In this paper, we demonstrate these two requirements can be satisfied via building a bridge between outfit generation and recommendation. Through large data analysis, we observe that people have similar tastes in individual items and outfits. Therefore, we propose a Personalized Outfit Generation (POG) model, which connects user preferences regarding individual items and outfits with Transformer architecture. Extensive offline and online experiments provide strong quantitative evidence that our method outperforms alternative methods regarding both compatibility and personalization metrics. Furthermore, we deploy POG on a platform named Dida in Alibaba to generate personalized outfits for the users of the online application iFashion. This work represents a first step towards an industrial-scale fashion outfit generation and recommendation solution, which goes beyond generating outfits based on explicit queries, or merely recommending from existing outfit pools. As part of this work, we release a large-scale dataset consisting of 1.01 million outfits with rich context information, and 0.28 billion user click actions from 3.57 million users. To the best of our knowledge, this dataset is the largest, publicly available, fashion related dataset, and the first to provide user behaviors relating to both outfits and fashion items., Comment: Till appear in KDD 2019
Published: 2019

39. The All-or-Nothing Phenomenon in Sparse Linear Regression

Author: Reeves, Galen, Xu, Jiaming, and Zadik, Ilias
Subjects: Mathematics - Statistics Theory, Computer Science - Information Theory, Mathematics - Probability
Abstract: We study the problem of recovering a hidden binary $k$-sparse $p$-dimensional vector $\beta$ from $n$ noisy linear observations $Y=X\beta+W$ where $X_{ij}$ are i.i.d. $\mathcal{N}(0,1)$ and $W_i$ are i.i.d. $\mathcal{N}(0,\sigma^2)$. A closely related hypothesis testing problem is to distinguish the pair $(X,Y)$ generated from this structured model from a corresponding null model where $(X,Y)$ consist of purely independent Gaussian entries. In the low sparsity $k=o(p)$ and high signal to noise ratio $k/\sigma^2=\Omega\left(1\right)$ regime, we establish an `All-or-Nothing' information-theoretic phase transition at a critical sample size $n^*=2 k\log \left(p/k\right) /\log \left(1+k/\sigma^2\right)$, resolving a conjecture of \cite{gamarnikzadik}. Specifically, we show that if $\liminf_{p\to \infty} n/n^*>1$, then the maximum likelihood estimator almost perfectly recovers the hidden vector with high probability and moreover the true hypothesis can be detected with a vanishing error probability. Conversely, if $\limsup_{p\to \infty} n/n^*<1$, then it becomes information-theoretically impossible even to recover an arbitrarily small but fixed fraction of the hidden vector support, or to test hypotheses strictly better than random guess. Our proof of the impossibility result builds upon two key techniques, which could be of independent interest. First, we use a conditional second moment method to upper bound the Kullback-Leibler (KL) divergence between the structured and the null model. Second, inspired by the celebrated area theorem, we establish a lower bound to the minimum mean squared estimation error of the hidden vector in terms of the KL divergence between the two models., Comment: 40 pages, 2 figures
Published: 2019

40. Improved queue-size scaling for input-queued switches via graph factorization

Author: Xu, Jiaming and Zhong, Yuan
Subjects: Computer Science - Networking and Internet Architecture
Abstract: This paper studies the scaling of the expected total queue size in an $n\times n$ input-queued switch, as a function of both the load $\rho$ and the system scale $n$. We provide a new class of scheduling policies under which the expected total queue size scales as $O\left( n(1-\rho)^{-4/3} \log \left(\max\{\frac{1}{1-\rho}, n\}\right)\right)$, over all $n$ and $\rho<1$, when the arrival rates are uniform. This improves over the previously best-known scalings in two regimes: $O\left(n^{1.5}(1-\rho)^{-1} \log \frac{1}{1-\rho}\right)$ when $\Omega(n^{-1.5}) \le 1-\rho \le O(n^{-1})$ and $O\left(\frac{n\log n}{(1-\rho)^2}\right)$ when $1-\rho \geq \Omega(n^{-1})$. A key ingredient in our method is a tight characterization of the largest $k$-factor of a random bipartite multigraph, which may be of independent interest., Comment: 42 pages, 4 figures
Published: 2019

41. Efficient random graph matching via degree profiles

Author: Ding, Jian, Ma, Zongming, Wu, Yihong, and Xu, Jiaming
Subjects: Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Information Theory, Computer Science - Machine Learning, Mathematics - Statistics Theory
Abstract: Random graph matching refers to recovering the underlying vertex correspondence between two random graphs with correlated edges; a prominent example is when the two random graphs are given by Erd\H{o}s-R\'{e}nyi graphs $G(n,\frac{d}{n})$. This can be viewed as an average-case and noisy version of the graph isomorphism problem. Under this model, the maximum likelihood estimator is equivalent to solving the intractable quadratic assignment problem. This work develops an $\tilde{O}(n d^2+n^2)$-time algorithm which perfectly recovers the true vertex correspondence with high probability, provided that the average degree is at least $d = \Omega(\log^2 n)$ and the two graphs differ by at most $\delta = O( \log^{-2}(n) )$ fraction of edges. For dense graphs and sparse graphs, this can be improved to $\delta = O( \log^{-2/3}(n) )$ and $\delta = O( \log^{-2}(d) )$ respectively, both in polynomial time. The methodology is based on appropriately chosen distance statistics of the degree profiles (empirical distribution of the degrees of neighbors). Before this work, the best known result achieves $\delta=O(1)$ and $n^{o(1)} \leq d \leq n^c$ for some constant $c$ with an $n^{O(\log n)}$-time algorithm \cite{barak2018nearly} and $\delta=\tilde O((d/n)^4)$ and $d = \tilde{\Omega}(n^{4/5})$ with a polynomial-time algorithm \cite{dai2018performance}., Comment: Proof of Theorem 4 expanded and revised
Published: 2018

42. Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

Author: Shi, Jing, Xu, Jiaming, Yao, Yiqun, and Xu, Bo
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Deep neural networks have shown superior performance in many regimes to remember familiar patterns with large amounts of data. However, the standard supervised deep learning paradigm is still limited when facing the need to learn new concepts efficiently from scarce data. In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning. The training procedure, imitating the concept formation course of human, learns how to distinguish samples from different classes and aggregate samples of the same kind. In order to better utilize the advantages originated from the human behavior, we propose a sequential process, during which the network should decide how to remember each sample at every step. In this sequential process, a stable and interactive memory serves as an important module. We validate our model in some typical one-shot learning tasks and also an exploratory outlier detection problem. In all the experiments, our model gets highly competitive to reach or outperform those strong baselines., Comment: 27 pages, 2 figures
Published: 2018
Full Text: View/download PDF

43. Convex Relaxation Methods for Community Detection

Author: Li, Xiaodong, Chen, Yudong, and Xu, Jiaming
Subjects: Mathematics - Statistics Theory, Computer Science - Social and Information Networks, Statistics - Machine Learning
Abstract: This paper surveys recent theoretical advances in convex optimization approaches for community detection. We introduce some important theoretical techniques and results for establishing the consistency of convex community detection under various statistical models. In particular, we discuss the basic techniques based on the primal and dual analysis. We also present results that demonstrate several distinctive advantages of convex community detection, including robustness against outlier nodes, consistency under weak assortativity, and adaptivity to heterogeneous degrees. This survey is not intended to be a complete overview of the vast literature on this fast-growing topic. Instead, we aim to provide a big picture of the remarkable recent development in this area and to make the survey accessible to a broad audience. We hope that this expository article can serve as an introductory guide for readers who are interested in using, designing, and analyzing convex relaxation methods in network analysis., Comment: 22 pages
Published: 2018

44. Cascaded Mutual Modulation for Visual Reasoning

Author: Yao, Yiqun, Xu, Jiaming, Wang, Feng, and Xu, Bo
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm that both our multistep framework and our visual-guided language modulation are critical to the task. Our code is available at https://github.com/FlamingHorizon/CMM-VR., Comment: to appear in EMNLP 2018
Published: 2018

45. Seeded Graph Matching via Large Neighborhood Statistics

Author: Mossel, Elchanan and Xu, Jiaming
Subjects: Computer Science - Machine Learning, Computer Science - Discrete Mathematics, Computer Science - Data Structures and Algorithms, Computer Science - Information Theory, Statistics - Machine Learning
Abstract: We study a well known noisy model of the graph isomorphism problem. In this model, the goal is to perfectly recover the vertex correspondence between two edge-correlated Erd\H{o}s-R\'{e}nyi random graphs, with an initial seed set of correctly matched vertex pairs revealed as side information. For seeded problems, our result provides a significant improvement over previously known results. We show that it is possible to achieve the information-theoretic limit of graph sparsity in time polynomial in the number of vertices $n$. Moreover, we show the number of seeds needed for exact recovery in polynomial-time can be as low as $n^{3\epsilon}$ in the sparse graph regime (with the average degree smaller than $n^{\epsilon}$) and $\Omega(\log n)$ in the dense graph regime. Our results also shed light on the unseeded problem. In particular, we give sub-exponential time algorithms for sparse models and an $n^{O(\log n)}$ algorithm for dense models for some parameters, including some that are not covered by recent results of Barak et al.
Published: 2018

46. Statistical Problems with Planted Structures: Information-Theoretical and Computational Limits

Author: Wu, Yihong and Xu, Jiaming
Subjects: Mathematics - Statistics Theory, Computer Science - Information Theory
Abstract: Over the past few years, insights from computer science, statistical physics, and information theory have revealed phase transitions in a wide array of high-dimensional statistical problems at two distinct thresholds: One is the information-theoretical (IT) threshold below which the observation is too noisy so that inference of the ground truth structure is impossible regardless of the computational cost; the other is the computational threshold above which inference can be performed efficiently, i.e., in time that is polynomial in the input size. In the intermediate regime, inference is information-theoretically possible, but conjectured to be computationally hard. This article provides a survey of the common techniques for determining the sharp IT and computational limits, using community detection and submatrix detection as illustrating examples. For IT limits, we discuss tools including the first and second moment method for analyzing the maximum likelihood estimator, information-theoretic methods for proving impossibility results using mutual information and rate-distortion theory, and methods originated from statistical physics such as interpolation method. To investigate computational limits, we describe a common recipe to construct a randomized polynomial-time reduction scheme that approximately maps instances of the planted clique problem to the problem of interest in total variation distance., Comment: Chapter in "Information-Theoretic Methods in Data Science". Edited by Yonina Eldar and Miguel Rodrigues, Cambridge University Press, forthcoming
Published: 2018

47. Securing Distributed Gradient Descent in High Dimensional Statistical Learning

Author: Su, Lili and Xu, Jiaming
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We consider unreliable distributed learning systems wherein the training data is kept confidential by external workers, and the learner has to interact closely with those workers to train a model. In particular, we assume that there exists a system adversary that can adaptively compromise some workers; the compromised workers deviate from their local designed specifications by sending out arbitrarily malicious messages. We assume in each communication round, up to $q$ out of the $m$ workers suffer Byzantine faults. Each worker keeps a local sample of size $n$ and the total sample size is $N=nm$. We propose a secured variant of the gradient descent method that can tolerate up to a constant fraction of Byzantine workers, i.e., $q/m = O(1)$. Moreover, we show the statistical estimation error of the iterates converges in $O(\log N)$ rounds to $O(\sqrt{q/N} + \sqrt{d/N})$, where $d$ is the model dimension. As long as $q=O(d)$, our proposed algorithm achieves the optimal error rate $O(\sqrt{d/N})$. Our results are obtained under some technical assumptions. Specifically, we assume strongly-convex population risk. Nevertheless, the empirical risk (sample version) is allowed to be non-convex. The core of our method is to robustly aggregate the gradients computed by the workers based on the filtering procedure proposed by Steinhardt et al. On the technical front, deviating from the existing literature on robustly estimating a finite-dimensional mean vector, we establish a {\em uniform} concentration of the sample covariance matrix of gradients, and show that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function. To get a near-optimal uniform concentration bound, we develop a new matrix concentration inequality, which might be of independent interest.
Published: 2018

48. Hidden Hamiltonian Cycle Recovery via Linear Programming

Author: Bagaria, Vivek, Ding, Jian, Tse, David, Wu, Yihong, and Xu, Jiaming
Subjects: Computer Science - Discrete Mathematics, Computer Science - Data Structures and Algorithms, Mathematics - Probability, Statistics - Machine Learning
Abstract: We introduce the problem of hidden Hamiltonian cycle recovery, where there is an unknown Hamiltonian cycle in an $n$-vertex complete graph that needs to be inferred from noisy edge measurements. The measurements are independent and distributed according to $\calP_n$ for edges in the cycle and $\calQ_n$ otherwise. This formulation is motivated by a problem in genome assembly, where the goal is to order a set of contigs (genome subsequences) according to their positions on the genome using long-range linking measurements between the contigs. Computing the maximum likelihood estimate in this model reduces to a Traveling Salesman Problem (TSP). Despite the NP-hardness of TSP, we show that a simple linear programming (LP) relaxation, namely the fractional $2$-factor (F2F) LP, recovers the hidden Hamiltonian cycle with high probability as $n \to \infty$ provided that $\alpha_n - \log n \to \infty$, where $\alpha_n \triangleq -2 \log \int \sqrt{d P_n d Q_n}$ is the R\'enyi divergence of order $\frac{1}{2}$. This condition is information-theoretically optimal in the sense that, under mild distributional assumptions, $\alpha_n \geq (1+o(1)) \log n$ is necessary for any algorithm to succeed regardless of the computational cost. Departing from the usual proof techniques based on dual witness construction, the analysis relies on the combinatorial characterization (in particular, the half-integrality) of the extreme points of the F2F polytope. Represented as bicolored multi-graphs, these extreme points are further decomposed into simpler "blossom-type" structures for the large deviation analysis and counting arguments. Evaluation of the algorithm on real data shows improvements over existing approaches.
Published: 2018

49. Rates of Convergence of Spectral Methods for Graphon Estimation

Author: Xu, Jiaming
Subjects: Statistics - Machine Learning, Computer Science - Learning, Computer Science - Social and Information Networks, Mathematics - Statistics Theory
Abstract: This paper studies the problem of estimating the grahpon model - the underlying generating mechanism of a network. Graphon estimation arises in many applications such as predicting missing links in networks and learning user preferences in recommender systems. The graphon model deals with a random graph of $n$ vertices such that each pair of two vertices $i$ and $j$ are connected independently with probability $\rho \times f(x_i,x_j)$, where $x_i$ is the unknown $d$-dimensional label of vertex $i$, $f$ is an unknown symmetric function, and $\rho$ is a scaling parameter characterizing the graph sparsity. Recent studies have identified the minimax error rate of estimating the graphon from a single realization of the random graph. However, there exists a wide gap between the known error rates of computationally efficient estimation procedures and the minimax optimal error rate. Here we analyze a spectral method, namely universal singular value thresholding (USVT) algorithm, in the relatively sparse regime with the average vertex degree $n\rho=\Omega(\log n)$. When $f$ belongs to H\"{o}lder or Sobolev space with smoothness index $\alpha$, we show the error rate of USVT is at most $(n\rho)^{ -2 \alpha / (2\alpha+d)}$, approaching the minimax optimal error rate $\log (n\rho)/(n\rho)$ for $d=1$ as $\alpha$ increases. Furthermore, when $f$ is analytic, we show the error rate of USVT is at most $\log^d (n\rho)/(n\rho)$. In the special case of stochastic block model with $k$ blocks, the error rate of USVT is at most $k/(n\rho)$, which is larger than the minimax optimal error rate by at most a multiplicative factor $k/\log k$. This coincides with the computational gap observed for community detection. A key step of our analysis is to derive the eigenvalue decaying rate of the edge probability matrix using piecewise polynomial approximations of the graphon function $f$.
Published: 2017

50. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

Author: Chen, Yudong, Su, Lili, and Xu, Jiaming
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Cryptography and Security, Computer Science - Learning, Statistics - Machine Learning
Abstract: We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning -- a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function.
Published: 2017

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

77 results on '"Xu, JiaMing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources