sample_data.jsonl

{"title": "adaptive and optimal online linear regression on $\\ell^1$-balls", "id": "1105.4042", "abstract": "we consider the problem of online linear regression on individual sequences. the goal in this paper is for the forecaster to output sequential predictions which are, after $t$ time rounds, almost as good as the ones output by the best linear predictor in a given $\\ell^1$-ball in $\\\\r^d$. we consider both the cases where the dimension~$d$ is small and large relative to the time horizon $t$. we first present regret bounds with optimal dependencies on $d$, $t$, and on the sizes $u$, $x$ and $y$ of the $\\ell^1$-ball, the input data and the observations. the minimax regret is shown to exhibit a regime transition around the point $d = \\sqrt{t} u x / (2 y)$. furthermore, we present efficient algorithms that are adaptive, \\ie, that do not require the knowledge of $u$, $x$, $y$, and $t$, but still achieve nearly optimal regret bounds.", "categories": "stat.ml cs.lg math.st stat.th", "doi": "", "created": "2011-05-20", "updated": "2019-01-16", "authors": ["sébastien gerchinovitz", "jia yuan yu"], "affiliation": [], "url": "https://arxiv.org/abs/1105.4042"}
{"title": "high-dimensional feature selection by feature-wise kernelized lasso", "id": "1202.0515", "abstract": "the goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. the least absolute shrinkage and selection operator (lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. in this paper, we consider a feature-wise kernelized lasso for capturing non-linear input-output dependency. we first show that, with particular choices of kernel functions, non-redundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures. we then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. the effectiveness of the proposed method is demonstrated through feature selection experiments with thousands of features.", "categories": "stat.ml cs.ai stat.me", "doi": "10.1162/neco_a_00537", "created": "2012-02-02", "updated": "2019-01-03", "authors": ["makoto yamada", "wittawat jitkrittum", "leonid sigal", "eric p. xing", "masashi sugiyama"], "affiliation": [], "url": "https://arxiv.org/abs/1202.0515"}
{"title": "similarity learning for provably accurate sparse linear classification", "id": "1206.6476", "abstract": "in recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. most of the state of the art focus on learning mahalanobis distances (requiring to fulfill a constraint of positive semi-definiteness) for use in a local k-nn algorithm. however, no theoretical link is established between the learned metrics and their performance in classification. in this paper, we make use of the formal framework of good similarities introduced by balcan et al. to design an algorithm for learning a non psd linear similarity optimized in a nonlinear feature space, which is then used to build a global linear classifier. we show that our approach has uniform stability and derive a generalization bound on the classification error. experiments performed on various datasets confirm the effectiveness of our approach compared to state-of-the-art methods and provide evidence that (i) it is fast, (ii) robust to overfitting and (iii) produces very sparse classifiers.", "categories": "cs.lg cs.ai stat.ml", "doi": "", "created": "2012-06-27", "updated": "", "authors": ["aurelien bellet", "amaury habrard", "marc sebban"], "affiliation": ["university of saint-etienne", "university of saint-etienne", "university of saint-etienne"], "url": "https://arxiv.org/abs/1206.6476"}
{"title": "robustness and generalization for metric learning", "id": "1209.1086", "abstract": "metric learning has attracted a lot of interest over the last decade, but the generalization ability of such methods has not been thoroughly studied. in this paper, we introduce an adaptation of the notion of algorithmic robustness (previously introduced by xu and mannor) that can be used to derive generalization bounds for metric learning. we further show that a weak notion of robustness is in fact a necessary and sufficient condition for a metric learning algorithm to generalize. to illustrate the applicability of the proposed framework, we derive generalization results for a large family of existing metric learning algorithms, including some sparse formulations that are not covered by previous results.", "categories": "cs.lg cs.ai stat.ml", "doi": "10.1016/j.neucom.2014.09.044", "created": "2012-09-05", "updated": "2014-09-29", "authors": ["aurélien bellet", "amaury habrard"], "affiliation": [], "url": "https://arxiv.org/abs/1209.1086"}
{"title": "a survey on metric learning for feature vectors and structured data", "id": "1306.6709", "abstract": "the need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. this has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and related fields for the past ten years. this survey paper proposes a systematic review of the metric learning literature, highlighting the pros and cons of each approach. we pay particular attention to mahalanobis distance metric learning, a well-studied and successful framework, but additionally present a wide range of methods that have recently emerged as powerful alternatives, including nonlinear metric learning, similarity learning and local metric learning. recent trends and extensions, such as semi-supervised metric learning, metric learning for histogram data and the derivation of generalization guarantees, are also covered. finally, this survey addresses metric learning for structured data, in particular edit distance learning, and attempts to give an overview of the remaining challenges in metric learning for the years to come.", "categories": "cs.lg cs.ai stat.ml", "doi": "", "created": "2013-06-27", "updated": "2014-02-12", "authors": ["aurélien bellet", "amaury habrard", "marc sebban"], "affiliation": [], "url": "https://arxiv.org/abs/1306.6709"}
{"title": "supervised metric learning with generalization guarantees", "id": "1307.4514", "abstract": "the crucial importance of metrics in machine learning algorithms has led to an increasing interest in optimizing distance and similarity functions, an area of research known as metric learning. when data consist of feature vectors, a large body of work has focused on learning a mahalanobis distance. less work has been devoted to metric learning from structured objects (such as strings or trees), most of it focusing on optimizing a notion of edit distance. we identify two important limitations of current metric learning approaches. first, they allow to improve the performance of local algorithms such as k-nearest neighbors, but metric learning for global algorithms (such as linear classifiers) has not been studied so far. second, the question of the generalization ability of metric learning methods has been largely ignored. in this thesis, we propose theoretical and algorithmic contributions that address these limitations. our first contribution is the derivation of a new kernel function built from learned edit probabilities. our second contribution is a novel framework for learning string and tree edit similarities inspired by the recent theory of (e,g,t)-good similarity functions. using uniform stability arguments, we establish theoretical guarantees for the learned similarity that give a bound on the generalization error of a linear classifier built from that similarity. in our third contribution, we extend these ideas to metric learning from feature vectors by proposing a bilinear similarity learning method that efficiently optimizes the (e,g,t)-goodness. generalization guarantees are derived for our approach, highlighting that our method minimizes a tighter bound on the generalization error of the classifier. our last contribution is a framework for establishing generalization bounds for a large class of existing metric learning algorithms based on a notion of algorithmic robustness.", "categories": "cs.lg cs.ai stat.ml", "doi": "", "created": "2013-07-17", "updated": "2013-07-23", "authors": ["aurélien bellet"], "affiliation": [], "url": "https://arxiv.org/abs/1307.4514"}
{"title": "a distributed frank-wolfe algorithm for communication-efficient sparse   learning", "id": "1404.2644", "abstract": "learning sparse combinations is a frequent theme in machine learning. in this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. we address the key challenges of balancing communication costs and optimization errors. to this end, we propose a distributed frank-wolfe (dfw) algorithm. we obtain theoretical guarantees on the optimization error $\\epsilon$ and communication cost that do not depend on the total number of combining elements. we further show that the communication cost of dfw is optimal by deriving a lower-bound on the communication cost required to construct an $\\epsilon$-approximate solution. we validate our theoretical analysis with empirical studies on synthetic and real-world data, which demonstrate that dfw outperforms both baselines and competing methods. we also study the performance of dfw when the conditions of our analysis are relaxed, and show that dfw is fairly robust.", "categories": "cs.dc cs.ai cs.lg stat.ml", "doi": "", "created": "2014-04-09", "updated": "2015-01-12", "authors": ["aurélien bellet", "yingyu liang", "alireza bagheri garakani", "maria-florina balcan", "fei sha"], "affiliation": [], "url": "https://arxiv.org/abs/1404.2644"}
{"title": "sparse compositional metric learning", "id": "1404.4105", "abstract": "we propose a new approach for metric learning by framing it as learning a sparse combination of locally discriminative metrics that are inexpensive to generate from the training data. this flexible framework allows us to naturally derive formulations for global, multi-task and local metric learning. the resulting algorithms have several advantages over existing methods in the literature: a much smaller number of parameters to be estimated and a principled way to generalize learned metrics to new testing data points. to analyze the approach theoretically, we derive a generalization bound that justifies the sparse combination. empirically, we evaluate our algorithms on several datasets against state-of-the-art metric learning methods. the results are consistent with our theoretical findings and demonstrate the superiority of our approach in terms of classification performance and scalability.", "categories": "cs.lg cs.ai stat.ml", "doi": "", "created": "2014-04-15", "updated": "", "authors": ["yuan shi", "aurélien bellet", "fei sha"], "affiliation": [], "url": "https://arxiv.org/abs/1404.4105"}
{"title": "persistent homology in sparse regression and its application to brain   morphometry", "id": "1409.0177", "abstract": "sparse systems are usually parameterized by a tuning parameter that determines the sparsity of the system. how to choose the right tuning parameter is a fundamental and difficult problem in learning the sparse system. in this paper, by treating the the tuning parameter as an additional dimension, persistent homological structures over the parameter space is introduced and explored. the structures are then further exploited in speeding up the computation using the proposed soft-thresholding technique. the topological structures are further used as multivariate features in the tensor-based morphometry (tbm) in characterizing white matter alterations in children who have experienced severe early life stress and maltreatment. these analyses reveal that stress-exposed children exhibit more diffuse anatomical organization across the whole white matter region.", "categories": "stat.me cs.cv", "doi": "10.1109/tmi.2015.2416271", "created": "2014-08-30", "updated": "2015-03-09", "authors": ["moo k. chung", "jamie l. hanson", "jieping ye", "richard j. davidson", "seth d. pollak"], "affiliation": [], "url": "https://arxiv.org/abs/1409.0177"}
{"title": "how to scale up kernel methods to be as good as deep neural nets", "id": "1411.4000", "abstract": "the computational complexity of kernel methods has often been a major barrier for applying them to large-scale learning problems. we argue that this barrier can be effectively overcome. in particular, we develop methods to scale up kernel models to successfully tackle large-scale learning problems that are so far only approachable by deep learning architectures. based on the seminal work by rahimi and recht on approximating kernel functions with features derived from random projections, we advance the state-of-the-art by proposing methods that can efficiently train models with hundreds of millions of parameters, and learn optimal representations from multiple kernels. we conduct extensive empirical studies on problems from image recognition and automatic speech recognition, and show that the performance of our kernel models matches that of well-engineered deep neural nets (dnns). to the best of our knowledge, this is the first time that a direct comparison between these two methods on large-scale problems is reported. our kernel methods have several appealing properties: training with convex optimization, cost for training a single model comparable to dnns, and significantly reduced total cost due to fewer hyperparameters to tune for model selection. our contrastive study between these two very different but equally competitive models sheds light on fundamental questions such as how to learn good representations.", "categories": "cs.lg cs.ai stat.ml", "doi": "", "created": "2014-11-14", "updated": "2015-06-17", "authors": ["zhiyun lu", "avner may", "kuan liu", "alireza bagheri garakani", "dong guo", "aurélien bellet", "linxi fan", "michael collins", "brian kingsbury", "michael picheny", "fei sha"], "affiliation": [], "url": "https://arxiv.org/abs/1411.4000"}
{"title": "a projection based conditional dependence measure with applications to   high-dimensional undirected graphical models", "id": "1501.01617", "abstract": "measuring conditional dependence is an important topic in statistics with broad applications including graphical models. under a factor model setting, a new conditional dependence measure based on projection is proposed. the corresponding conditional independence test is developed with the asymptotic null distribution unveiled where the number of factors could be high-dimensional. it is also shown that the new test has control over the asymptotic significance level and can be calculated efficiently. a generic method for building dependency graphs without gaussian assumption using the new test is elaborated. numerical results and real data analysis show the superiority of the new method.", "categories": "stat.me math.st stat.ap stat.ml stat.th", "doi": "", "created": "2015-01-07", "updated": "2019-01-11", "authors": ["jianqing fan", "yang feng", "lucy xia"], "affiliation": [], "url": "https://arxiv.org/abs/1501.01617"}
{"title": "scaling-up empirical risk minimization: optimization of incomplete   u-statistics", "id": "1501.02629", "abstract": "in a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by $u$-statistics of degree $d\\geq 1$, i.e. functionals of the training data with low variance that take the form of averages over $k$-tuples. from a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size $n$, as it requires averaging $o(n^d)$ terms. this makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. it is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler monte-carlo estimates based on $o(n)$ terms only, usually referred to as incomplete $u$-statistics, without damaging the $o_{\\mathbb{p}}(1/\\sqrt{n})$ learning rate of empirical risk minimization (erm) procedures. for this purpose, we establish uniform deviation results describing the error made when approximating a $u$-process by its incomplete version under appropriate complexity assumptions. extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for erm. finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques.", "categories": "stat.ml cs.ai cs.lg", "doi": "", "created": "2015-01-12", "updated": "2016-04-19", "authors": ["stéphan clémençon", "aurélien bellet", "igor colin"], "affiliation": [], "url": "https://arxiv.org/abs/1501.02629"}
{"title": "extreme compressive sampling for covariance estimation", "id": "1506.00898", "abstract": "this paper studies the problem of estimating the covariance of a collection of vectors using only highly compressed measurements of each vector. an estimator based on back-projections of these compressive samples is proposed and analyzed. a distribution-free analysis shows that by observing just a single linear measurement of each vector, one can consistently estimate the covariance matrix, in both infinity and spectral norm, and this same analysis leads to precise rates of convergence in both norms. via information-theoretic techniques, lower bounds showing that this estimator is minimax-optimal for both infinity and spectral norm estimation problems are established. these results are also specialized to give matching upper and lower bounds for estimating the population covariance of a collection of gaussian vectors, again in the compressive measurement model. the analysis conducted in this paper shows that the effective sample complexity for this problem is scaled by a factor of $m^2/d^2$ where $m$ is the compression dimension and $d$ is the ambient dimension. applications to subspace learning (principal components analysis) and learning over distributed sensor networks are also discussed.", "categories": "stat.ml cs.it math.it", "doi": "", "created": "2015-06-02", "updated": "2019-01-14", "authors": ["martin azizyan", "akshay krishnamurthy", "aarti singh"], "affiliation": [], "url": "https://arxiv.org/abs/1506.00898"}
{"title": "measuring sample quality with stein's method", "id": "1506.03039", "abstract": "to improve the efficiency of monte carlo estimation, practitioners are turning to biased markov chain monte carlo procedures that trade off asymptotic exactness for computational speed. the reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. however, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. to address these challenges, we introduce a new computable quality measure based on stein's method that quantifies the maximum discrepancy between sample and target expectations over a large class of test functions. we use our tool to compare exact, biased, and deterministic sample sequences and illustrate applications to hyperparameter selection, convergence rate assessment, and quantifying bias-variance tradeoffs in posterior inference.", "categories": "stat.ml cs.lg math.pr stat.me", "doi": "", "created": "2015-06-09", "updated": "2018-12-31", "authors": ["jackson gorham", "lester mackey"], "affiliation": [], "url": "https://arxiv.org/abs/1506.03039"}
{"title": "recovering metric from full ordinal information", "id": "1506.03762", "abstract": "given a geodesic space (e, d), we show that full ordinal knowledge on the metric d-i.e. knowledge of the function d d : (w, x, y, z) $\\rightarrow$ 1 d(w,x)$\\le$d(y,z) , determines uniquely-up to a constant factor-the metric d. for a subspace en of n points of e, converging in hausdorff distance to e, we construct a metric dn on en, based only on the knowledge of d d on en and establish a sharp upper bound of the gromov-hausdorff distance between (en, dn) and (e, d).", "categories": "stat.ml math.st stat.th", "doi": "", "created": "2015-06-11", "updated": "2018-12-29", "authors": ["thibaut le gouic"], "affiliation": ["i2m, cs-hse"], "url": "https://arxiv.org/abs/1506.03762"}
{"title": "offline handwritten signature verification - literature review", "id": "1507.07909", "abstract": "the area of handwritten signature verification has been broadly researched in the last decades, but remains an open research problem. the objective of signature verification systems is to discriminate if a given signature is genuine (produced by the claimed individual), or a forgery (produced by an impostor). this has demonstrated to be a challenging task, in particular in the offline (static) scenario, that uses images of scanned signatures, where the dynamic information about the signing process is not available. many advancements have been proposed in the literature in the last 5-10 years, most notably the application of deep learning methods to learn feature representations from signature images. in this paper, we present how the problem has been handled in the past few decades, analyze the recent advancements in the field, and the potential directions for future research.", "categories": "cs.cv stat.ml", "doi": "10.1109/ipta.2017.8310112", "created": "2015-07-28", "updated": "2017-10-16", "authors": ["luiz g. hafemann", "robert sabourin", "luiz s. oliveira"], "affiliation": [], "url": "https://arxiv.org/abs/1507.07909"}
{"title": "extending gossip algorithms to distributed estimation of u-statistics", "id": "1511.05464", "abstract": "efficient and robust algorithms for decentralized estimation in networks are essential to many distributed systems. whereas distributed estimation of sample mean statistics has been the subject of a good deal of attention, computation of $u$-statistics, relying on more expensive averaging over pairs of observations, is a less investigated area. yet, such data functionals are essential to describe global properties of a statistical population, with important examples including area under the curve, empirical variance, gini mean difference and within-cluster point scatter. this paper proposes new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the $u$-statistic of interest. we establish convergence rate bounds of $o(1/t)$ and $o(\\log t / t)$ for the synchronous and asynchronous cases respectively, where $t$ is the number of iterations, with explicit data and network dependent terms. beyond favorable comparisons in terms of rate analysis, numerical experiments provide empirical evidence the proposed algorithms surpasses the previously introduced approach.", "categories": "stat.ml cs.dc cs.lg cs.sy stat.co", "doi": "", "created": "2015-11-17", "updated": "", "authors": ["igor colin", "aurélien bellet", "joseph salmon", "stéphan clémençon"], "affiliation": [], "url": "https://arxiv.org/abs/1511.05464"}
{"title": "sparse convex clustering", "id": "1601.04586", "abstract": "convex clustering, a convex relaxation of k-means clustering and hierarchical clustering, has drawn recent attentions since it nicely addresses the instability issue of traditional nonconvex clustering methods. although its computational and statistical properties have been recently studied, the performance of convex clustering has not yet been investigated in the high-dimensional clustering scenario, where the data contains a large number of features and many of them carry no information about the clustering structure. in this paper, we demonstrate that the performance of convex clustering could be distorted when the uninformative features are included in the clustering. to overcome it, we introduce a new clustering method, referred to as sparse convex clustering, to simultaneously cluster observations and conduct feature selection. the key idea is to formulate convex clustering in a form of regularization, with an adaptive group-lasso penalty term on cluster centers. in order to optimally balance the tradeoff between the cluster fitting and sparsity, a tuning criterion based on clustering stability is developed. in theory, we provide an unbiased estimator for the degrees of freedom of the proposed sparse convex clustering method. finally, the effectiveness of the sparse convex clustering is examined through a variety of numerical experiments and a real data application.", "categories": "stat.me cs.lg stat.ml", "doi": "10.1080/10618600.2017.1377081", "created": "2016-01-18", "updated": "2017-02-10", "authors": ["binhuan wang", "yilong zhang", "will wei sun", "yixin fang"], "affiliation": [], "url": "https://arxiv.org/abs/1601.04586"}
{"title": "accelerated randomized mirror descent algorithms for composite   non-strongly convex optimization", "id": "1605.06892", "abstract": "we consider the problem of minimizing the sum of an average function of a large number of smooth convex components and a general, possibly non-differentiable, convex function. although many methods have been proposed to solve this problem with the assumption that the sum is strongly convex, few methods support the non-strongly convex case. adding a small quadratic regularization is a common devise used to tackle non-strongly convex problems; however, it may cause loss of sparsity of solutions or weaken the performance of the algorithms. avoiding this devise, we propose an accelerated randomized mirror descent method for solving this problem without the strongly convex assumption. our method extends the deterministic accelerated proximal gradient methods of paul tseng and can be applied even when proximal points are computed inexactly. we also propose a scheme for solving the problem when the component functions are non-smooth.", "categories": "math.oc stat.ml", "doi": "", "created": "2016-05-23", "updated": "2018-12-31", "authors": ["le thi khanh hien", "cuong v. nguyen", "huan xu", "canyi lu", "jiashi feng"], "affiliation": [], "url": "https://arxiv.org/abs/1605.06892"}
{"title": "differentially private gaussian processes", "id": "1606.00720", "abstract": "a major challenge for machine learning is increasing the availability of data while respecting the privacy of individuals. here we combine the provable privacy guarantees of the differential privacy framework with the flexibility of gaussian processes (gps). we propose a method using gps to provide differentially private (dp) regression. we then improve this method by crafting the dp noise covariance structure to efficiently protect the training data, while minimising the scale of the added noise. we find that this cloaking method achieves the greatest accuracy, while still providing privacy guarantees, and offers practical dp for regression over multi-dimensional inputs. together these methods provide a starter toolkit for combining differential privacy and gps.", "categories": "stat.ml cs.lg", "doi": "", "created": "2016-06-02", "updated": "2019-01-17", "authors": ["michael thomas smith", "max zwiessele", "neil d. lawrence"], "affiliation": [], "url": "https://arxiv.org/abs/1606.00720"}
{"title": "gossip dual averaging for decentralized optimization of pairwise   functions", "id": "1606.02421", "abstract": "in decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. in this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the nodes of a graph defining the communication topology of the network. this general problem finds applications in ranking, distance metric learning and graph inference, among others. we propose new gossip algorithms based on dual averaging which aims at solving such problems both in synchronous and asynchronous settings. the proposed framework is flexible enough to deal with constrained and regularized variants of the optimization problem. our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. we present numerical simulations on area under the roc curve (auc) maximization and metric learning problems which illustrate the practical interest of our approach.", "categories": "stat.ml cs.ai cs.dc cs.lg cs.sy", "doi": "", "created": "2016-06-08", "updated": "", "authors": ["igor colin", "aurélien bellet", "joseph salmon", "stéphan clémençon"], "affiliation": [], "url": "https://arxiv.org/abs/1606.02421"}
{"title": "multi-view kernel consensus for data analysis", "id": "1606.08819", "abstract": "the input data features set for many data driven tasks is high-dimensional while the intrinsic dimension of the data is low. data analysis methods aim to uncover the underlying low dimensional structure imposed by the low dimensional hidden parameters by utilizing distance metrics that consider the set of attributes as a single monolithic set. however, the transformation of the low dimensional phenomena into the measured high dimensional observations might distort the distance metric, this distortion can effect the desired estimated low dimensional geometric structure. in this paper, we suggest to utilize the redundancy in the attribute domain by partitioning the attributes into multiple subsets we call views. the proposed methods utilize the agreement also called consensus between different views to extract valuable geometric information that unifies multiple views about the intrinsic relationships among several different observations. this unification enhances the information that a single view or a simple concatenations of views provides.", "categories": "cs.lg stat.ml", "doi": "", "created": "2016-06-28", "updated": "2019-01-29", "authors": ["moshe salhov", "ofir lindenbaum", "yariv aizenbud", "avi silberschatz", "yoel shkolnisky", "amir averbuch"], "affiliation": [], "url": "https://arxiv.org/abs/1606.08819"}
{"title": "iterative hard thresholding for model selection in genome-wide   association studies", "id": "1608.01398", "abstract": "a genome-wide association study (gwas) correlates marker variation with trait variation in a sample of individuals. each study subject is genotyped at a multitude of snps (single nucleotide polymorphisms) spanning the genome. here we assume that subjects are unrelated and collected at random and that trait values are normally distributed or transformed to normality. over the past decade, researchers have been remarkably successful in applying gwas analysis to hundreds of traits. the massive amount of data produced in these studies present unique computational challenges. penalized regression with lasso or mcp penalties is capable of selecting a handful of associated snps from millions of potential snps. unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. this paper introduces the iterative hard thresholding (iht) algorithm to the gwas analysis of continuous traits. our parallel implementation of iht accommodates snp genotype compression and exploits multiple cpu cores and graphics processing units (gpus). this allows statistical geneticists to leverage commodity desktop computers in gwas analysis and to avoid supercomputing. we evaluate iht performance on both simulated and real gwas data and conclude that it reduces false positive and false negative rates while remaining competitive in computational time with penalized regression. source code is freely available at https://github.com/klkeys/iht.jl.", "categories": "stat.ml", "doi": "10.1002/gepi.22068", "created": "2016-08-03", "updated": "2017-07-24", "authors": ["kevin l. keys", "gary k. chen", "kenneth lange"], "affiliation": [], "url": "https://arxiv.org/abs/1608.01398"}
{"title": "extracting replicable associations across multiple studies: algorithms   for controlling the false discovery rate", "id": "1609.01118", "abstract": "extracting associations that recur across multiple studies while controlling the false discovery rate is a fundamental challenge. here, we consider an extension of efron's single-study two-groups model to allow joint analysis of multiple studies. we assume that given a set of p-values obtained from each study, the researcher is interested in associations that recur in at least $k>1$ studies. we propose new algorithms that differ in how the study dependencies are modeled. we compared our new methods and others using various simulated scenarios. the top performing algorithm, screen (scalable cluster-based replicability enhancement), is our new algorithm that is based on three stages: (1) clustering an estimated correlation network of the studies, (2) learning replicability (e.g., of genes) within clusters, and (3) merging the results across the clusters using dynamic programming. we applied screen to two real datasets and demonstrated that it greatly outperforms the results obtained via standard meta-analysis. first, on a collection of 29 case-control large-scale gene expression cancer studies, we detected a large up-regulated module of genes related to proliferation and cell cycle regulation. these genes are both consistently up-regulated across many cancer studies, and are well connected in known gene networks. second, on a recent pan-cancer study that examined the expression profiles of patients with or without mutations in the hla complex, we detected an active module of up-regulated genes that are related to immune responses. thanks to our ability to quantify the false discovery rate, we detected thrice more genes as compared to the original study. our module contains most of the genes reported in the original study, and many new ones. interestingly, the newly discovered genes are needed to establish the connectivity of the module.", "categories": "stat.me q-bio.gn stat.co", "doi": "10.1371/journal.pcbi.1005700", "created": "2016-09-05", "updated": "2016-09-07", "authors": ["david amar", "ron shamir", "daniel yekutieli"], "affiliation": [], "url": "https://arxiv.org/abs/1609.01118"}
{"title": "informative planning and online learning with sparse gaussian processes", "id": "1609.07560", "abstract": "a big challenge in environmental monitoring is the spatiotemporal variation of the phenomena to be observed. to enable persistent sensing and estimation in such a setting, it is beneficial to have a time-varying underlying environmental model. here we present a planning and learning method that enables an autonomous marine vehicle to perform persistent ocean monitoring tasks by learning and refining an environmental model. to alleviate the computational bottleneck caused by large-scale data accumulated, we propose a framework that iterates between a planning component aimed at collecting the most information-rich data, and a sparse gaussian process learning component where the environmental model and hyperparameters are learned online by taking advantage of only a subset of data that provides the greatest contribution. our simulations with ground-truth ocean data shows that the proposed method is both accurate and efficient.", "categories": "cs.ro cs.ai cs.lg stat.ml", "doi": "", "created": "2016-09-23", "updated": "", "authors": ["kai-chieh ma", "lantao liu", "gaurav s. sukhatme"], "affiliation": [], "url": "https://arxiv.org/abs/1609.07560"}
{"title": "an inexact variable metric proximal point algorithm for generic   quasi-newton acceleration", "id": "1610.00960", "abstract": "we propose an inexact variable-metric proximal point algorithm to accelerate gradient-based optimization algorithms. the proposed scheme, called qning can be notably applied to incremental first-order methods such as the stochastic variance-reduced gradient descent algorithm (svrg) and other randomized incremental optimization algorithms. qning is also compatible with composite objectives, meaning that it has the ability to provide exactly sparse solutions when the objective involves a sparsity-inducing regularization. when combined with limited-memory bfgs rules, qning is particularly effective to solve high-dimensional optimization problems, while enjoying a worst-case linear convergence rate for strongly convex problems. we present experimental results where qning gives significant improvements over competing methods for training machine learning methods on large samples and in high dimensions.", "categories": "stat.ml math.oc", "doi": "", "created": "2016-10-04", "updated": "2019-01-29", "authors": ["hongzhou lin", "julien mairal", "zaid harchaoui"], "affiliation": [], "url": "https://arxiv.org/abs/1610.00960"}
{"title": "indirect gaussian graph learning beyond gaussianity", "id": "1610.02590", "abstract": "this paper studies how to capture dependency graph structures from real data which may not be multivariate gaussian. starting from marginal loss functions not necessarily derived from probability distributions, we utilize an additive over-parametrization with shrinkage to incorporate variable dependencies into the criterion. an iterative gaussian graph learning algorithm is proposed with ease in implementation. statistical analysis shows that the estimators achieve satisfactory accuracy with the error measured in terms of a proper bregman divergence. real-life examples in different settings are given to demonstrate the efficacy of the proposed methodology.", "categories": "stat.ml stat.me", "doi": "", "created": "2016-10-08", "updated": "2019-01-12", "authors": ["yiyuan she", "shao tang", "qiaoya zhang"], "affiliation": [], "url": "https://arxiv.org/abs/1610.02590"}
{"title": "learning from survey training samples: rate bounds for horvitz-thompson   risk minimizers", "id": "1610.03316", "abstract": "the generalization ability of minimizers of the empirical risk in the context of binary classification has been investigated under a wide variety of complexity assumptions for the collection of classifiers over which optimization is performed. in contrast, the vast majority of the works dedicated to this issue stipulate that the training dataset used to compute the empirical risk functional is composed of i.i.d. observations. beyond the cases where training data are drawn uniformly without replacement among a large i.i.d. sample or modelled as a realization of a weakly dependent sequence of r.v.'s, statistical guarantees when the data used to train a classifier are drawn by means of a more general sampling/survey scheme and exhibit a complex dependence structure have not been documented yet. it is the main purpose of this paper to show that the theory of empirical risk minimization can be extended to situations where statistical learning is based on survey samples and knowledge of the related inclusion probabilities. precisely, we prove that minimizing a weighted version of the empirical risk, refered to as the horvitz-thompson risk (ht risk), over a class of controlled complexity lead to a rate for the excess risk of the order $o_{\\mathbb{p}}((\\kappa_n (\\log n)/n)^{1/2})$ with $\\kappa_n=(n/n)/\\min_{i\\leq n}\\pi_i$, when data are sampled by means of a rejective scheme of (deterministic) size $n$ within a statistical population of cardinality $n\\geq n$, a generalization of basic {\\it sampling without replacement} with unequal probability weights $\\pi_i>0$. extension to other sampling schemes are then established by a coupling argument. beyond theoretical results, numerical experiments are displayed in order to show the relevance of ht risk minimization and that ignoring the sampling scheme used to generate the training dataset may completely jeopardize the learning procedure.", "categories": "math.st stat.th", "doi": "", "created": "2016-10-11", "updated": "2019-01-18", "authors": ["clémençon stephan", "patrice bertail", "guillaume papa"], "affiliation": [], "url": "https://arxiv.org/abs/1610.03316"}
{"title": "decentralized collaborative learning of personalized models over   networks", "id": "1610.05202", "abstract": "we consider a set of learning agents in a collaborative peer-to-peer network, where each agent learns a personalized model according to its own learning objective. the question addressed in this paper is: how can agents improve upon their locally trained model by communicating with other agents that have similar objectives? we introduce and analyze two asynchronous gossip algorithms running in a fully decentralized manner. our first approach, inspired from label propagation, aims to smooth pre-trained local models over the network while accounting for the confidence that each agent has in its initial model. in our second approach, agents jointly learn and propagate their model by making iterative updates based on both their local dataset and the behavior of their neighbors. to optimize this challenging objective, our decentralized algorithm is based on admm.", "categories": "cs.lg cs.ai cs.dc cs.sy stat.ml", "doi": "", "created": "2016-10-17", "updated": "2017-02-15", "authors": ["paul vanhaesebrouck", "aurélien bellet", "marc tommasi"], "affiliation": [], "url": "https://arxiv.org/abs/1610.05202"}
{"title": "a theoretical analysis of noisy sparse subspace clustering on   dimensionality-reduced data", "id": "1610.07650", "abstract": "subspace clustering is the problem of partitioning unlabeled data points into a number of clusters so that data points within one cluster lie approximately on a low-dimensional linear subspace. in many practical scenarios, the dimensionality of data points to be clustered are compressed due to constraints of measurement, computation or privacy. in this paper, we study the theoretical properties of a popular subspace clustering algorithm named sparse subspace clustering (ssc) and establish formal success conditions of ssc on dimensionality-reduced data. our analysis applies to the most general fully deterministic model where both underlying subspaces and data points within each subspace are deterministically positioned, and also a wide range of dimensionality reduction techniques (e.g., gaussian random projection, uniform subsampling, sketching) that fall into a subspace embedding framework (meng & mahoney, 2013; avron et al., 2014). finally, we apply our analysis to a differentially private ssc algorithm and established both privacy and utility guarantees of the proposed method.", "categories": "stat.ml cs.lg", "doi": "", "created": "2016-10-24", "updated": "", "authors": ["yining wang", "yu-xiang wang", "aarti singh"], "affiliation": [], "url": "https://arxiv.org/abs/1610.07650"}
{"title": "adversarial influence maximization", "id": "1611.00350", "abstract": "we consider the problem of influence maximization in fixed networks for contagion models in an adversarial setting. the goal is to select an optimal set of nodes to seed the influence process, such that the number of influenced nodes at the conclusion of the campaign is as large as possible. we formulate the problem as a repeated game between a player and adversary, where the adversary specifies the edges along which the contagion may spread, and the player chooses sets of nodes to influence in an online fashion. we establish upper and lower bounds on the minimax pseudo-regret in both undirected and directed networks.", "categories": "cs.si cs.lg stat.ml", "doi": "", "created": "2016-11-01", "updated": "2019-01-19", "authors": ["justin khim", "varun jog", "po-ling loh"], "affiliation": [], "url": "https://arxiv.org/abs/1611.00350"}
{"title": "practical heteroskedastic gaussian process modeling for large simulation   experiments", "id": "1611.05902", "abstract": "we present a unified view of likelihood based gaussian progress regression for simulation experiments exhibiting input-dependent noise. replication plays an important role in that context, however previous methods leveraging replicates have either ignored the computational savings that come from such design, or have short-cut full likelihood-based inference to remain tractable. starting with homoskedastic processes, we show how multiple applications of a well-known woodbury identity facilitate inference for all parameters under the likelihood (without approximation), bypassing the typical full-data sized calculations. we then borrow a latent-variable idea from machine learning to address heteroskedasticity, adapting it to work within the same thrifty inferential framework, thereby simultaneously leveraging the computational and statistical efficiency of designs with replication. the result is an inferential scheme that can be characterized as single objective function, complete with closed form derivatives, for rapid library-based optimization. illustrations are provided, including real-world simulation experiments from manufacturing and the management of epidemics.", "categories": "stat.me stat.co", "doi": "10.1080/10618600.2018.1458625", "created": "2016-11-17", "updated": "2017-11-13", "authors": ["mickael binois", "robert b. gramacy", "michael ludkovski"], "affiliation": [], "url": "https://arxiv.org/abs/1611.05902"}
{"title": "a convex program for mixed linear regression with a recovery guarantee   for well-separated data", "id": "1612.06067", "abstract": "we introduce a convex approach for mixed linear regression over $d$ features. this approach is a second-order cone program, based on l1 minimization, which assigns an estimate regression coefficient in $\\mathbb{r}^{d}$ for each data point. these estimates can then be clustered using, for example, $k$-means. for problems with two or more mixture classes, we prove that the convex program exactly recovers all of the mixture components in the noiseless setting under technical conditions that include a well-separation assumption on the data. under these assumptions, recovery is possible if each class has at least $d$ independent measurements. we also explore an iteratively reweighted least squares implementation of this method on real and synthetic data.", "categories": "math.oc stat.ml", "doi": "10.1093/imaiai/iax018", "created": "2016-12-19", "updated": "2017-12-21", "authors": ["paul hand", "babhru joshi"], "affiliation": [], "url": "https://arxiv.org/abs/1612.06067"}
{"title": "computing human-understandable strategies", "id": "1612.06340", "abstract": "algorithms for equilibrium computation generally make no attempt to ensure that the computed strategies are understandable by humans. for instance the strategies for the strongest poker agents are represented as massive binary files. in many situations, we would like to compute strategies that can actually be implemented by humans, who may have computational limitations and may only be able to remember a small number of features or components of the strategies that have been computed. we study poker games where private information distributions can be arbitrary. we create a large training set of game instances and solutions, by randomly selecting the information probabilities, and present algorithms that learn from the training instances in order to perform well in games with unseen information distributions. we are able to conclude several new fundamental rules about poker strategy that can be easily implemented by humans.", "categories": "cs.gt cs.ai cs.lg cs.ma stat.ml", "doi": "", "created": "2016-12-19", "updated": "2017-02-20", "authors": ["sam ganzfried", "farzana yusuf"], "affiliation": [], "url": "https://arxiv.org/abs/1612.06340"}
{"title": "the pessimistic limits and possibilities of margin-based losses in   semi-supervised learning", "id": "1612.08875", "abstract": "consider a classification problem where we have both labeled and unlabeled data available. we show that for linear classifiers defined by convex margin-based surrogate losses that are decreasing, it is impossible to construct any semi-supervised approach that is able to guarantee an improvement over the supervised classifier measured by this surrogate loss on the labeled and unlabeled data. for convex margin-based loss functions that also increase, we demonstrate safe improvements are possible.", "categories": "stat.ml cs.lg", "doi": "", "created": "2016-12-28", "updated": "2019-01-08", "authors": ["jesse h. krijthe", "marco loog"], "affiliation": [], "url": "https://arxiv.org/abs/1612.08875"}
{"title": "kernel approximation methods for speech recognition", "id": "1701.03577", "abstract": "we study large-scale kernel methods for acoustic modeling in speech recognition and compare their performance to deep neural networks (dnns). we perform experiments on four speech recognition datasets, including the timit and broadcast news benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/character error rate). in order to scale kernel methods to these large datasets, we use the random fourier feature method of rahimi and recht (2007). we propose two novel techniques for improving the performance of kernel acoustic models. first, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection. the method is able to explore a large number of non-linear features while maintaining a compact model more efficiently than existing approaches. second, we present a number of frame-level metrics which correlate very strongly with recognition performance when computed on the heldout set; we take advantage of these correlations by monitoring these metrics during training in order to decide when to stop learning. this technique can noticeably improve the recognition performance of both dnn and kernel models, while narrowing the gap between them. additionally, we show that the linear bottleneck method of sainath et al. (2013) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. together, these three methods dramatically improve the performance of kernel acoustic models, making their performance comparable to dnns on the tasks we explored.", "categories": "stat.ml cs.ai cs.cl cs.lg", "doi": "", "created": "2017-01-13", "updated": "", "authors": ["avner may", "alireza bagheri garakani", "zhiyun lu", "dong guo", "kuan liu", "aurélien bellet", "linxi fan", "michael collins", "daniel hsu", "brian kingsbury", "michael picheny", "fei sha"], "affiliation": [], "url": "https://arxiv.org/abs/1701.03577"}
{"title": "stochastic graphlet embedding", "id": "1702.00156", "abstract": "graph-based methods are known to be successful in many machine learning and pattern classification tasks. these methods consider semi-structured data as graphs where nodes correspond to primitives (parts, interest points, segments, etc.) and edges characterize the relationships between these primitives. however, these non-vectorial graph data cannot be straightforwardly plugged into off-the-shelf machine learning algorithms without a preliminary step of -- explicit/implicit -- graph vectorization and embedding. this embedding process should be resilient to intra-class graph variations while being highly discriminant. in this paper, we propose a novel high-order stochastic graphlet embedding (sge) that maps graphs into vector spaces. our main contribution includes a new stochastic search procedure that efficiently parses a given graph and extracts/samples unlimitedly high-order graphlets. we consider these graphlets, with increasing orders, to model local primitives as well as their increasingly complex interactions. in order to build our graph representation, we measure the distribution of these graphlets into a given graph, using particular hash functions that efficiently assign sampled graphlets into isomorphic sets with a very low probability of collision. when combined with maximum margin classifiers, these graphlet-based representations have positive impact on the performance of pattern comparison and recognition as corroborated through extensive experiments using standard benchmark databases.", "categories": "cs.cv cs.lg stat.ml", "doi": "10.1109/tnnls.2018.2884700", "created": "2017-02-01", "updated": "2018-12-04", "authors": ["anjan dutta", "hichem sahbi"], "affiliation": [], "url": "https://arxiv.org/abs/1702.00156"}
{"title": "predicting pairwise relations with neural similarity encoders", "id": "1702.01824", "abstract": "matrix factorization is at the heart of many machine learning algorithms, for example, dimensionality reduction (e.g. kernel pca) or recommender systems relying on collaborative filtering. understanding a singular value decomposition (svd) of a matrix as a neural network optimization problem enables us to decompose large matrices efficiently while dealing naturally with missing values in the given matrix. but most importantly, it allows us to learn the connection between data points' feature vectors and the matrix containing information about their pairwise relations. in this paper we introduce a novel neural network architecture termed similarity encoder (simec), which is designed to simultaneously factorize a given target matrix while also learning the mapping to project the data points' feature vectors into a similarity preserving embedding space. this makes it possible to, for example, easily compute out-of-sample solutions for new data points. additionally, we demonstrate that simec can preserve non-metric similarities and even predict multiple pairwise relations between data points at once.", "categories": "stat.ml cs.lg", "doi": "10.24425/bpas.2018.125929", "created": "2017-02-06", "updated": "2019-01-09", "authors": ["franziska horn", "klaus-robert müller"], "affiliation": [], "url": "https://arxiv.org/abs/1702.01824"}
{"title": "low rank matrix recovery with simultaneous presence of outliers and   sparse corruption", "id": "1702.01847", "abstract": "we study a data model in which the data matrix d can be expressed as d = l + s + c, where l is a low rank matrix, s an element-wise sparse matrix and c a matrix whose non-zero columns are outlying data points. to date, robust pca algorithms have solely considered models with either s or c, but not both. as such, existing algorithms cannot account for simultaneous element-wise and column-wise corruptions. in this paper, a new robust pca algorithm that is robust to simultaneous types of corruption is proposed. our approach hinges on the sparse approximation of a sparsely corrupted column so that the sparse expansion of a column with respect to the other data points is used to distinguish a sparsely corrupted inlier column from an outlying data point. we also develop a randomized design which provides a scalable implementation of the proposed approach. the core idea of sparse approximation is analyzed analytically where we show that the underlying ell_1-norm minimization can obtain the representation of an inlier in presence of sparse corruptions.", "categories": "stat.ml cs.cv cs.lg", "doi": "10.1109/jstsp.2018.2876604", "created": "2017-02-06", "updated": "", "authors": ["mostafa rahmani", "george atia"], "affiliation": [], "url": "https://arxiv.org/abs/1702.01847"}
{"title": "tuple-oriented compression for large-scale mini-batch stochastic   gradient descent", "id": "1702.06943", "abstract": "data compression is a popular technique for improving the efficiency of data processing workloads such as sql queries and more recently, machine learning (ml) with classical batch gradient methods. but the efficacy of such ideas for mini-batch stochastic gradient descent (mgd), arguably the workhorse algorithm of modern ml, is an open question. mgd's unique data access pattern renders prior art, including those designed for batch gradient methods, less effective. we fill this crucial research gap by proposing a new lossless compression scheme we call tuple-oriented compression (toc) that is inspired by an unlikely source, the string/text compression scheme lempel-ziv-welch, but tailored to mgd in a way that preserves tuple boundaries within mini-batches. we then present a suite of novel compressed matrix operation execution techniques tailored to the toc compression scheme that operate directly over the compressed data representation and avoid decompression overheads. an extensive empirical evaluation with real-world datasets shows that toc consistently achieves substantial compression ratios by up to 51x and reduces runtimes for mgd workloads by up to 10.2x in popular ml systems.", "categories": "cs.lg cs.db stat.ml", "doi": "10.1145/3299869.3300070", "created": "2017-02-22", "updated": "2019-01-20", "authors": ["fengan li", "lingjiao chen", "yijing zeng", "arun kumar", "jeffrey f. naughton", "jignesh m. patel", "xi wu"], "affiliation": [], "url": "https://arxiv.org/abs/1702.06943"}
{"title": "optimized cost per click in taobao display advertising", "id": "1703.02091", "abstract": "taobao, as the largest online retail platform in the world, provides billions of online display advertising impressions for millions of advertisers every day. for commercial purposes, the advertisers bid for specific spots and target crowds to compete for business traffic. the platform chooses the most suitable ads to display in tens of milliseconds. common pricing methods include cost per mille (cpm) and cost per click (cpc). traditional advertising systems target certain traits of users and ad placements with fixed bids, essentially regarded as coarse-grained matching of bid and traffic quality. however, the fixed bids set by the advertisers competing for different quality requests cannot fully optimize the advertisers' key requirements. moreover, the platform has to be responsible for the business revenue and user experience. thus, we proposed a bid optimizing strategy called optimized cost per click (ocpc) which automatically adjusts the bid to achieve finer matching of bid and traffic quality of page view (pv) request granularity. our approach optimizes advertisers' demands, platform business revenue and user experience and as a whole improves traffic allocation efficiency. we have validated our approach in taobao display advertising system in production. the online a/b test shows our algorithm yields substantially better results than previous fixed bid manner.", "categories": "cs.gt stat.ml", "doi": "10.1145/3097983.3098134", "created": "2017-02-27", "updated": "2019-01-29", "authors": ["han zhu", "junqi jin", "chang tan", "fei pan", "yifan zeng", "han li", "kun gai"], "affiliation": [], "url": "https://arxiv.org/abs/1703.02091"}
{"title": "multivariate gaussian and student$-t$ process regression for   multi-output prediction", "id": "1703.04455", "abstract": "gaussian process model for vector-valued function has been shown to be useful for multi-output prediction. the existing method for this model is to re-formulate the matrix-variate gaussian distribution as a multivariate normal distribution. although it is effective in many cases, re-formulation is not always workable and is difficult to apply to other distributions because not all matrix-variate distributions can be transformed to respective multivariate distributions, such as the case for matrix-variate student$-t$ distribution. in this paper, we propose a unified framework which is used not only to introduce a novel multivariate student$-t$ process regression model (mv-tpr) for multi-output prediction, but also to reformulate the multivariate gaussian process regression (mv-gpr) that overcomes some limitations of the existing methods. both mv-gpr and mv-tpr have closed-form expressions for the marginal likelihoods and predictive distributions under this unified framework and thus can adopt the same optimization approaches as used in the conventional gpr. the usefulness of the proposed methods is illustrated through several simulated and real data examples. in particular, we verify empirically that mv-tpr has superiority for the datasets considered, including air quality prediction and bike rent prediction. at last, the proposed methods are shown to produce profitable investment strategies in the stock markets.", "categories": "stat.ml", "doi": "", "created": "2017-03-13", "updated": "2019-01-06", "authors": ["zexun chen", "bo wang", "alexander n. gorban"], "affiliation": [], "url": "https://arxiv.org/abs/1703.04455"}
{"title": "the multi-armed bandit problem: an efficient non-parametric solution", "id": "1703.08285", "abstract": "lai and robbins (1985) and lai (1987) provided efficient parametric solutions to the multi-armed bandit problem, showing that arm allocation via upper confidence bounds (ucb) achieves minimum regret. these bounds are constructed from the kullback-leibler information of the reward distributions, estimated from specified parametric families. in recent years there has been renewed interest in the multi-armed bandit problem due to new applications in machine learning algorithms and data analytics. non-parametric arm allocation procedures like $\\epsilon$-greedy, boltzmann exploration and besa were studied, and modified versions of the ucb procedure were also analyzed under non-parametric settings. however unlike ucb these non-parametric procedures are not efficient under general parametric settings. in this paper we propose efficient non-parametric procedures.", "categories": "math.st stat.th", "doi": "", "created": "2017-03-24", "updated": "2019-01-16", "authors": ["hock peng chan"], "affiliation": [], "url": "https://arxiv.org/abs/1703.08285"}
{"title": "binarsity: a penalization for one-hot encoded features in linear   supervised learning", "id": "1703.08619", "abstract": "this paper deals with the problem of large-scale linear supervised learning in settings where a large number of continuous features are available. we propose to combine the well-known trick of one-hot encoding of continuous features with a new penalization called \\emph{binarsity}. in each group of binary features coming from the one-hot encoding of a single raw continuous feature, this penalization uses total-variation regularization together with an extra linear constraint. this induces two interesting properties on the model weights of the one-hot encoded features: they are piecewise constant, and are eventually block sparse. non-asymptotic oracle inequalities for generalized linear models are proposed. moreover, under a sparse additive model assumption, we prove that our procedure matches the state-of-the-art in this setting. numerical experiments illustrate the good performances of our approach on several datasets. it is also noteworthy that our method has a numerical complexity comparable to standard $\\ell_1$ penalization.", "categories": "stat.ml", "doi": "", "created": "2017-03-24", "updated": "2019-01-09", "authors": ["mokhtar z. alaya", "simon bussy", "stéphane gaïffas", "agathe guilloux"], "affiliation": [], "url": "https://arxiv.org/abs/1703.08619"}
{"title": "catalyst acceleration for gradient-based non-convex optimization", "id": "1703.10993", "abstract": "we introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and signal processing. in general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of nesterov and achieves near-optimal convergence rate in function values. these properties are achieved without assuming any knowledge about the convexity of the objective, by automatically adapting to the unknown weak convexity constant. we conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as svrg and saga for sparse matrix factorization and for learning neural networks.", "categories": "stat.ml math.oc", "doi": "", "created": "2017-03-31", "updated": "2018-12-31", "authors": ["courtney paquette", "hongzhou lin", "dmitriy drusvyatskiy", "julien mairal", "zaid harchaoui"], "affiliation": [], "url": "https://arxiv.org/abs/1703.10993"}
{"title": "a comparative study of counterfactual estimators", "id": "1704.00773", "abstract": "we provide a comparative study of several widely used off-policy estimators (empirical average, basic importance sampling and normalized importance sampling), detailing the different regimes where they are individually suboptimal. we then exhibit properties optimal estimators should possess. in the case where examples have been gathered using multiple policies, we show that fused estimators dominate basic ones but can still be improved.", "categories": "stat.ml cs.lg", "doi": "", "created": "2017-04-03", "updated": "2019-01-29", "authors": ["thomas nedelec", "nicolas le roux", "vianney perchet"], "affiliation": [], "url": "https://arxiv.org/abs/1704.00773"}
{"title": "learning from comparisons and choices", "id": "1704.07228", "abstract": "when tracking user-specific online activities, each user's preference is revealed in the form of choices and comparisons. for example, a user's purchase history is a record of her choices, i.e. which item was chosen among a subset of offerings. a user's preferences can be observed either explicitly as in movie ratings or implicitly as in viewing times of news articles. given such individualized ordinal data in the form of comparisons and choices, we address the problem of collaboratively learning representations of the users and the items. the learned features can be used to predict a user's preference of an unseen item to be used in recommendation systems. this also allows one to compute similarities among users and items to be used for categorization and search. motivated by the empirical successes of the multinomial logit (mnl) model in marketing and transportation, and also more recent successes in word embedding and crowdsourced image embedding, we pose this problem as learning the mnl model parameters that best explain the data. we propose a convex relaxation for learning the mnl model, and show that it is minimax optimal up to a logarithmic factor by comparing its performance to a fundamental lower bound. this characterizes the minimax sample complexity of the problem, and proves that the proposed estimator cannot be improved upon other than by a logarithmic factor. further, the analysis identifies how the accuracy depends on the topology of sampling via the spectrum of the sampling graph. this provides a guideline for designing surveys when one can choose which items are to be compared. this is accompanied by numerical simulations on synthetic and real data sets, confirming our theoretical predictions.", "categories": "stat.ml cs.lg", "doi": "", "created": "2017-04-24", "updated": "2018-12-30", "authors": ["sahand negahban", "sewoong oh", "kiran k. thekumparampil", "jiaming xu"], "affiliation": [], "url": "https://arxiv.org/abs/1704.07228"}
{"title": "improving drug sensitivity predictions in precision medicine through   active expert knowledge elicitation", "id": "1705.03290", "abstract": "predicting the efficacy of a drug for a given individual, using high-dimensional genomic measurements, is at the core of precision medicine. however, identifying features on which to base the predictions remains a challenge, especially when the sample size is small. incorporating expert knowledge offers a promising alternative to improve a prediction model, but collecting such knowledge is laborious to the expert if the number of candidate features is very large. we introduce a probabilistic model that can incorporate expert feedback about the impact of genomic measurements on the sensitivity of a cancer cell for a given drug. we also present two methods to intelligently collect this feedback from the expert, using experimental design and multi-armed bandit models. in a multiple myeloma blood cancer data set (n=51), expert knowledge decreased the prediction error by 8%. furthermore, the intelligent approaches can be used to reduce the workload of feedback collection to less than 30% on average compared to a naive approach.", "categories": "cs.ai cs.hc cs.lg stat.ml", "doi": "10.1093/bioinformatics/bty257", "created": "2017-05-09", "updated": "", "authors": ["iiris sundin", "tomi peltola", "muntasir mamun majumder", "pedram daee", "marta soare", "homayun afrabandpey", "caroline heckman", "samuel kaski", "pekka marttinen"], "affiliation": [], "url": "https://arxiv.org/abs/1705.03290"}
{"title": "spatial variational auto-encoding via matrix-variate normal   distributions", "id": "1705.06821", "abstract": "the key idea of variational auto-encoders (vaes) resembles that of traditional auto-encoder models in which spatial information is supposed to be explicitly encoded in the latent space. however, the latent variables in vaes are vectors, which can be interpreted as multiple feature maps of size 1x1. such representations can only convey spatial information implicitly when coupled with powerful decoders. in this work, we propose spatial vaes that use feature maps of larger size as latent variables to explicitly capture spatial information. this is achieved by allowing the latent variables to be sampled from matrix-variate normal (mvn) distributions whose parameters are computed from the encoder network. to increase dependencies among locations on latent feature maps and reduce the number of parameters, we further propose spatial vaes via low-rank mvn distributions. experimental results show that the proposed spatial vaes outperform original vaes in capturing rich structural and spatial information.", "categories": "cs.lg cs.cv cs.ne stat.ml", "doi": "", "created": "2017-05-18", "updated": "2019-01-22", "authors": ["zhengyang wang", "hao yuan", "shuiwang ji"], "affiliation": [], "url": "https://arxiv.org/abs/1705.06821"}
{"title": "size matters: cardinality-constrained clustering and outlier detection   via conic optimization", "id": "1705.07837", "abstract": "plain vanilla k-means clustering has proven to be successful in practice, yet it suffers from outlier sensitivity and may produce highly unbalanced clusters. to mitigate both shortcomings, we formulate a joint outlier detection and clustering problem, which assigns a prescribed number of datapoints to an auxiliary outlier cluster and performs cardinality-constrained k-means clustering on the residual dataset, treating the cluster cardinalities as a given input. we cast this problem as a mixed-integer linear program (milp) that admits tractable semidefinite and linear programming relaxations. we propose deterministic rounding schemes that transform the relaxed solutions to feasible solutions for the milp. we also prove that these solutions are optimal in the milp if a cluster separation condition holds.", "categories": "math.oc stat.ml", "doi": "", "created": "2017-05-22", "updated": "2019-01-10", "authors": ["napat rujeerapaiboon", "kilian schindler", "daniel kuhn", "wolfram wiesemann"], "affiliation": [], "url": "https://arxiv.org/abs/1705.07837"}
{"title": "the sup-norm perturbation of hosvd and low rank tensor denoising", "id": "1707.01207", "abstract": "the higher order singular value decomposition (hosvd) of tensors is a generalization of matrix svd. the perturbation analysis of hosvd under random noise is more delicate than its matrix counterpart. recently, polynomial time algorithms have been proposed where statistically optimal estimates of the singular subspaces and the low rank tensors are attainable in the euclidean norm. in this article, we analyze the sup-norm perturbation bounds of hosvd and introduce estimators of the singular subspaces with sharp deviation bounds in the sup-norm. we also investigate a low rank tensor denoising estimator and demonstrate its fast convergence rate with respect to the entry-wise errors. the sup-norm perturbation bounds reveal unconventional phase transitions for statistical learning applications such as the exact clustering in high dimensional gaussian mixture model and the exact support recovery in sub-tensor localizations. in addition, the bounds established for hosvd also elaborate the one-sided sup-norm perturbation bounds for the singular subspaces of unbalanced (or fat) matrices.", "categories": "math.st cs.it math.it math.pr stat.ml stat.th", "doi": "", "created": "2017-07-04", "updated": "2019-01-01", "authors": ["dong xia", "fan zhou"], "affiliation": [], "url": "https://arxiv.org/abs/1707.01207"}
{"title": "learning linear structural equation models in polynomial time and sample   complexity", "id": "1707.04673", "abstract": "the problem of learning structural equation models (sems) from data is a fundamental problem in causal inference. we develop a new algorithm --- which is computationally and statistically efficient and works in the high-dimensional regime --- for learning linear sems from purely observational data with arbitrary noise distribution. we consider three aspects of the problem: identifiability, computational efficiency, and statistical efficiency. we show that when data is generated from a linear sem over $p$ nodes and maximum degree $d$, our algorithm recovers the directed acyclic graph (dag) structure of the sem under an identifiability condition that is more general than those considered in the literature, and without faithfulness assumptions. in the population setting, our algorithm recovers the dag structure in $\\mathcal{o}(p(d^2 + \\log p))$ operations. in the finite sample setting, if the estimated precision matrix is sparse, our algorithm has a smoothed complexity of $\\widetilde{\\mathcal{o}}(p^3 + pd^7)$, while if the estimated precision matrix is dense, our algorithm has a smoothed complexity of $\\widetilde{\\mathcal{o}}(p^5)$. for sub-gaussian noise, we show that our algorithm has a sample complexity of $\\mathcal{o}(\\frac{d^8}{\\varepsilon^2} \\log (\\frac{p}{\\sqrt{\\delta}}))$ to achieve $\\varepsilon$ element-wise additive error with respect to the true autoregression matrix with probability at most $1 - \\delta$, while for noise with bounded $(4m)$-th moment, with $m$ being a positive integer, our algorithm has a sample complexity of $\\mathcal{o}(\\frac{d^8}{\\varepsilon^2} (\\frac{p^2}{\\delta})^{1/m})$.", "categories": "cs.lg stat.ml", "doi": "", "created": "2017-07-14", "updated": "", "authors": ["asish ghoshal", "jean honorio"], "affiliation": [], "url": "https://arxiv.org/abs/1707.04673"}
{"title": "training-image based geostatistical inversion using a spatial generative   adversarial neural network", "id": "1708.04975", "abstract": "probabilistic inversion within a multiple-point statistics framework is often computationally prohibitive for high-dimensional problems. to partly address this, we introduce and evaluate a new training-image based inversion approach for complex geologic media. our approach relies on a deep neural network of the generative adversarial network (gan) type. after training using a training image (ti), our proposed spatial gan (sgan) can quickly generate 2d and 3d unconditional realizations. a key characteristic of our sgan is that it defines a (very) low-dimensional parameterization, thereby allowing for efficient probabilistic inversion using state-of-the-art markov chain monte carlo (mcmc) methods. in addition, available direct conditioning data can be incorporated within the inversion. several 2d and 3d categorical tis are first used to analyze the performance of our sgan for unconditional geostatistical simulation. training our deep network can take several hours. after training, realizations containing a few millions of pixels/voxels can be produced in a matter of seconds. this makes it especially useful for simulating many thousands of realizations (e.g., for mcmc inversion) as the relative cost of the training per realization diminishes with the considered number of realizations. synthetic inversion case studies involving 2d steady-state flow and 3d transient hydraulic tomography with and without direct conditioning data are used to illustrate the effectiveness of our proposed sgan-based inversion. for the 2d case, the inversion rapidly explores the posterior model distribution. for the 3d case, the inversion recovers model realizations that fit the data close to the target level and visually resemble the true model well.", "categories": "stat.ml cs.cv physics.geo-ph", "doi": "10.1002/2017wr022148", "created": "2017-08-16", "updated": "2019-01-08", "authors": ["eric laloy", "romain hérault", "diederik jacques", "niklas linde"], "affiliation": [], "url": "https://arxiv.org/abs/1708.04975"}
{"title": "meta-learning mcmc proposals", "id": "1708.06040", "abstract": "effective implementations of sampling-based probabilistic inference often require manually constructed, model-specific proposals. inspired by recent progresses in meta-learning for training learning agents that can generalize to unseen environments, we propose a meta-learning approach to building effective and generalizable mcmc proposals. we parametrize the proposal as a neural network to provide fast approximations to block gibbs conditionals. the learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required. we explore several applications including open-universe gaussian mixture models, in which our learned proposals outperform a hand-tuned sampler, and a real-world named entity recognition task, in which our sampler yields higher final f1 scores than classical single-site gibbs sampling.", "categories": "cs.ai cs.lg stat.ml", "doi": "", "created": "2017-08-20", "updated": "2019-01-01", "authors": ["tongzhou wang", "yi wu", "david a. moore", "stuart j. russell"], "affiliation": [], "url": "https://arxiv.org/abs/1708.06040"}
{"title": "statistical inference for data-adaptive doubly robust estimators with   survival outcomes", "id": "1709.00401", "abstract": "the consistency of doubly robust estimators relies on consistent estimation of at least one of two nuisance regression parameters. in moderate to large dimensions, the use of flexible data-adaptive regression estimators may aid in achieving this consistency. however, $n^{1/2}$-consistency of doubly robust estimators is not guaranteed if one of the nuisance estimators is inconsistent. in this paper we present a doubly robust estimator for survival analysis with the novel property that it converges to a gaussian variable at $n^{1/2}$-rate for a large class of data-adaptive estimators of the nuisance parameters, under the only assumption that at least one of them is consistently estimated at a $n^{1/4}$-rate. this result is achieved through adaptation of recent ideas in semiparametric inference, which amount to: (i) gaussianizing (i.e., making asymptotically linear) a drift term that arises in the asymptotic analysis of the doubly robust estimator, and (ii) using cross-fitting to avoid entropy conditions on the nuisance estimators. we present the formula of the asymptotic variance of the estimator, which allows computation of doubly robust confidence intervals and p-values. we illustrate the finite-sample properties of the estimator in simulation studies, and demonstrate its use in a phase iii clinical trial for estimating the effect of a novel therapy for the treatment of her2 positive breast cancer.", "categories": "stat.ml", "doi": "", "created": "2017-09-01", "updated": "2019-01-29", "authors": ["iván díaz"], "affiliation": [], "url": "https://arxiv.org/abs/1709.00401"}
{"title": "interpretable graph-based semi-supervised learning via flows", "id": "1709.04764", "abstract": "in this paper, we consider the interpretability of the foundational laplacian-based semi-supervised learning approaches on graphs. we introduce a novel flow-based learning framework that subsumes the foundational approaches and additionally provides a detailed, transparent, and easily understood expression of the learning process in terms of graph flows. as a result, one can visualize and interactively explore the precise subgraph along which the information from labeled nodes flows to an unlabeled node of interest. surprisingly, the proposed framework avoids trading accuracy for interpretability, but in fact leads to improved prediction accuracy, which is supported both by theoretical considerations and empirical results. the flow-based framework guarantees the maximum principle by construction and can handle directed graphs in an out-of-the-box manner.", "categories": "stat.ml cs.lg", "doi": "", "created": "2017-09-14", "updated": "", "authors": ["raif m. rustamov", "james t. klosowski"], "affiliation": [], "url": "https://arxiv.org/abs/1709.04764"}
{"title": "deep reinforcement learning that matters", "id": "1709.06560", "abstract": "in recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (rl). reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. unfortunately, reproducing results for state-of-the-art deep rl methods is seldom straightforward. in particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. in this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. we illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep rl more reproducible. we aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.", "categories": "cs.lg stat.ml", "doi": "", "created": "2017-09-19", "updated": "2019-01-29", "authors": ["peter henderson", "riashat islam", "philip bachman", "joelle pineau", "doina precup", "david meger"], "affiliation": [], "url": "https://arxiv.org/abs/1709.06560"}
{"title": "enhanced quantum synchronization via quantum machine learning", "id": "1709.08519", "abstract": "we study the quantum synchronization between a pair of two-level systems inside two coupled cavities. by using a digital-analog decomposition of the master equation that rules the system dynamics, we show that this approach leads to quantum synchronization between both two-level systems. moreover, we can identify in this digital-analog block decomposition the fundamental elements of a quantum machine learning protocol, in which the agent and the environment (learning units) interact through a mediating system, namely, the register. if we can additionally equip this algorithm with a classical feedback mechanism, which consists of projective measurements in the register, reinitialization of the register state and local conditional operations on the agent and environment subspace, a powerful and flexible quantum machine learning protocol emerges. indeed, numerical simulations show that this protocol enhances the synchronization process, even when every subsystem experience different loss/decoherence mechanisms, and give us the flexibility to choose the synchronization state. finally, we propose an implementation based on current technologies in superconducting circuits.", "categories": "quant-ph cond-mat.mes-hall cs.ai cs.lg stat.ml", "doi": "10.1002/qute.201800076", "created": "2017-09-25", "updated": "2019-01-16", "authors": ["f. a. cárdenas-lópez", "m. sanz", "j. c. retamal", "e. solano"], "affiliation": [], "url": "https://arxiv.org/abs/1709.08519"}
{"title": "replication or exploration? sequential design for stochastic simulation   experiments", "id": "1710.03206", "abstract": "we investigate the merits of replication, and provide methods for optimal design (including replicates), with the goal of obtaining globally accurate emulation of noisy computer simulation experiments. we first show that replication can be beneficial from both design and computational perspectives, in the context of gaussian process surrogate modeling. we then develop a lookahead based sequential design scheme that can determine if a new run should be at an existing input location (i.e., replicate) or at a new one (explore). when paired with a newly developed heteroskedastic gaussian process model, our dynamic design scheme facilitates learning of signal and noise relationships which can vary throughout the input space. we show that it does so efficiently, on both computational and statistical grounds. in addition to illustrative synthetic examples, we demonstrate performance on two challenging real-data simulation experiments, from inventory management and epidemiology.", "categories": "stat.me stat.co", "doi": "10.1080/00401706.2018.1469433", "created": "2017-10-09", "updated": "2019-01-25", "authors": ["mickael binois", "jiangeng huang", "robert b gramacy", "mike ludkovski"], "affiliation": [], "url": "https://arxiv.org/abs/1710.03206"}
{"title": "manifold regularization based on nystr{\\\"o}m type subsampling", "id": "1710.04872", "abstract": "in this paper, we study the nystr{\\\"o}m type subsampling for large scale kernel methods to reduce the computational complexities of big data. we discuss the multi-penalty regularization scheme based on nystr{\\\"o}m type subsampling which is motivated from well-studied manifold regularization schemes. we develop a theoretical analysis of multi-penalty least-square regularization scheme under the general source condition in vector-valued function setting, therefore the results can also be applied to multi-task learning problems. we achieve the optimal minimax convergence rates of multi-penalty regularization using the concept of effective dimension for the appropriate subsampling size. we discuss an aggregation approach based on linear function strategy to combine various nystr{\\\"o}m approximants. finally, we demonstrate the performance of multi-penalty regularization based on nystr{\\\"o}m type subsampling on caltech-101 data set for multi-class image classification and nsl-kdd benchmark data set for intrusion detection problem.", "categories": "stat.ml cs.lg", "doi": "10.1016/j.acha.2018.12.002", "created": "2017-10-13", "updated": "", "authors": ["abhishake rastogi", "sivananthan sampath"], "affiliation": [], "url": "https://arxiv.org/abs/1710.04872"}
{"title": "general bayesian inference over the stiefel manifold via the givens   representation", "id": "1710.09443", "abstract": "we introduce an approach based on the givens representation that allows for a routine, reliable, and flexible way to infer bayesian models with orthogonal matrix parameters. this class of models most notably includes models from multivariate statistics such factor models and probabilistic principal component analysis (ppca). our approach overcomes several of the practical barriers to using the givens representation in a general bayesian inference framework. in particular, we show how to inexpensively compute the change-of-measure term necessary for transformations of random variables. we also show how to overcome specific topological pathologies that arise when representing circular random variables in an unconstrained space. in addition, we discuss how the alternative parameterization can be used to define new distributions over orthogonal matrices as well as to constrain parameter space to eliminate superfluous posterior modes in models such as ppca. while previous inference approaches to this problem involved specialized updates to the orthogonal matrix parameters, our approach lets us represent these constrained parameters in an unconstrained form. unlike previous approaches, this allows for the inference of models with orthogonal matrix parameters using any modern inference algorithm including those available in modern bayesian modeling frameworks such as stan, edward, or pymc3. we illustrate with examples how our approach can be used in practice in stan to infer models with orthogonal matrix parameters, and we compare to existing methods.", "categories": "stat.ml", "doi": "", "created": "2017-10-25", "updated": "2019-01-13", "authors": ["arya a pourzanjani", "richard m jiang", "brian mitchell", "paul j atzberger", "linda r petzold"], "affiliation": [], "url": "https://arxiv.org/abs/1710.09443"}
{"title": "generalized end-to-end loss for speaker verification", "id": "1710.10467", "abstract": "in this paper, we propose a new loss function called generalized end-to-end (ge2e) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (te2e) loss function. unlike te2e, the ge2e loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. additionally, the ge2e loss does not require an initial stage of example selection. with these properties, our model with the new loss function decreases speaker verification eer by more than 10%, while reducing the training time by 60% at the same time. we also introduce the multireader technique, which allows us to do domain adaptation - training a more accurate model that supports multiple keywords (i.e. \"ok google\" and \"hey google\") as well as multiple dialects.", "categories": "eess.as cs.cl cs.lg stat.ml", "doi": "", "created": "2017-10-28", "updated": "2019-01-24", "authors": ["li wan", "quan wang", "alan papir", "ignacio lopez moreno"], "affiliation": [], "url": "https://arxiv.org/abs/1710.10467"}
{"title": "convergence rates of latent topic models under relaxed identifiability   conditions", "id": "1710.11070", "abstract": "in this paper we study the frequentist convergence rate for the latent dirichlet allocation (blei et al., 2003) topic models. we show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of anandkumar et al. (2012, 2014) from an information-theoretical perspective. we also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.", "categories": "stat.ml cs.lg", "doi": "", "created": "2017-10-30", "updated": "2018-03-17", "authors": ["yining wang"], "affiliation": [], "url": "https://arxiv.org/abs/1710.11070"}
{"title": "moonshine: distilling with cheap convolutions", "id": "1711.02613", "abstract": "many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. we propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. using attention transfer, we provide pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. we show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data.", "categories": "stat.ml cs.cv cs.lg", "doi": "", "created": "2017-11-07", "updated": "2019-01-17", "authors": ["elliot j. crowley", "gavin gray", "amos storkey"], "affiliation": [], "url": "https://arxiv.org/abs/1711.02613"}
{"title": "crafting adversarial examples for speech paralinguistics applications", "id": "1711.03280", "abstract": "computational paralinguistic analysis is increasingly being used in a wide range of cyber applications, including security-sensitive applications such as speaker verification, deceptive speech detection, and medical diagnostics. while state-of-the-art machine learning techniques, such as deep neural networks, can provide robust and accurate speech analysis, they are susceptible to adversarial attacks. in this work, we propose an end-to-end scheme to generate adversarial examples for computational paralinguistic applications by perturbing directly the raw waveform of an audio recording rather than specific acoustic features. our experiments show that the proposed adversarial perturbation can lead to a significant performance drop of state-of-the-art deep neural networks, while only minimally impairing the audio quality.", "categories": "cs.lg cs.cr cs.sd eess.as stat.ml", "doi": "10.1145/3306195.3306196", "created": "2017-11-09", "updated": "2019-01-11", "authors": ["yuan gong", "christian poellabauer"], "affiliation": [], "url": "https://arxiv.org/abs/1711.03280"}
{"title": "how wrong am i? - studying adversarial examples and their impact on   uncertainty in gaussian process machine learning models", "id": "1711.06598", "abstract": "machine learning models are vulnerable to adversarial examples: minor perturbations to input samples intended to deliberately cause misclassification. current defenses against adversarial examples, especially for deep neural networks (dnn), are primarily derived from empirical developments, and their security guarantees are often only justified retroactively. many defenses therefore rely on hidden assumptions that are subsequently subverted by increasingly elaborate attacks. this is not surprising: deep learning notoriously lacks a comprehensive mathematical framework to provide meaningful guarantees.   in this paper, we leverage gaussian processes to investigate adversarial examples in the framework of bayesian inference. across different models and datasets, we find deviating levels of uncertainty reflect the perturbation introduced to benign samples by state-of-the-art attacks, including novel white-box attacks on gaussian processes. our experiments demonstrate that even unoptimized uncertainty thresholds already reject adversarial examples in many scenarios.   comment: thresholds can be broken in a modified attack, which was done in arxiv:1812.02606 (the limitations of model uncertainty in adversarial settings).", "categories": "cs.cr cs.lg stat.ml", "doi": "", "created": "2017-11-17", "updated": "2019-01-03", "authors": ["kathrin grosse", "david pfaff", "michael thomas smith", "michael backes"], "affiliation": [], "url": "https://arxiv.org/abs/1711.06598"}
{"title": "an improved oscillating-error classifier with branching", "id": "1711.07042", "abstract": "this paper extends the earlier work on an oscillating error correction technique. specifically, it extends the design to include further corrections, by adding new layers to the classifier through a branching method. this technique is still consistent with earlier work and also neural networks in general. with this extended design, the classifier can now achieve the high levels of accuracy reported previously.", "categories": "cs.lg stat.ml", "doi": "", "created": "2017-11-19", "updated": "2019-01-18", "authors": ["kieran greer"], "affiliation": [], "url": "https://arxiv.org/abs/1711.07042"}
{"title": "does mitigating ml's impact disparity require treatment disparity?", "id": "1711.07076", "abstract": "following related work in law and policy, two notions of disparity have come to shape the study of fairness in algorithmic decision-making. algorithms exhibit treatment disparity if they formally treat members of protected subgroups differently; algorithms exhibit impact disparity when outcomes differ across subgroups, even if the correlation arises unintentionally. naturally, we can achieve impact parity through purposeful treatment disparity. in one thread of technical work, papers aim to reconcile the two forms of parity proposing disparate learning processes (dlps). here, the learning algorithm can see group membership during training but produce a classifier that is group-blind at test time. in this paper, we show theoretically that: (i) when other features correlate to group membership, dlps will (indirectly) implement treatment disparity, undermining the policy desiderata they are designed to address; (ii) when group membership is partly revealed by other features, dlps induce within-class discrimination; and (iii) in general, dlps provide a suboptimal trade-off between accuracy and impact parity. based on our technical analysis, we argue that transparent treatment disparity is preferable to occluded methods for achieving impact parity. experimental results on several real-world datasets highlight the practical consequences of applying dlps vs. per-group thresholds.", "categories": "stat.ml cs.lg", "doi": "", "created": "2017-11-19", "updated": "2019-01-11", "authors": ["zachary c. lipton", "alexandra chouldechova", "julian mcauley"], "affiliation": [], "url": "https://arxiv.org/abs/1711.07076"}
{"title": "dependent relevance determination for smooth and structured sparse   regression", "id": "1711.10058", "abstract": "in many problem settings, parameter vectors are not merely sparse but dependent in such a way that non-zero coefficients tend to cluster together. we refer to this form of dependency as \"region sparsity.\" classical sparse regression methods, such as the lasso and automatic relevance determination (ard), which model parameters as independent a priori, and therefore do not exploit such dependencies. here we introduce a hierarchical model for smooth, region-sparse weight vectors and tensors in a linear regression setting. our approach represents a hierarchical extension of the relevance determination framework, where we add a transformed gaussian process to model the dependencies between the prior variances of regression weights. we combine this with a structured model of the prior variances of fourier coefficients, which eliminates unnecessary high frequencies. the resulting prior encourages weights to be region-sparse in two different bases simultaneously. we develop laplace approximation and monte carlo markov chain (mcmc) sampling to provide efficient inference for the posterior. furthermore, a two-stage convex relaxation of the laplace approximation approach is also provided to relax the inevitable non-convexity during the optimization. we finally show substantial improvements over comparable methods for both simulated and real datasets from brain imaging.", "categories": "stat.ml", "doi": "", "created": "2017-11-27", "updated": "2019-01-24", "authors": ["anqi wu", "oluwasanmi koyejo", "jonathan w. pillow"], "affiliation": [], "url": "https://arxiv.org/abs/1711.10058"}
{"title": "estimation and optimization of composite outcomes", "id": "1711.10581", "abstract": "there is tremendous interest in precision medicine as a means to improve patient outcomes by tailoring treatment to individual characteristics. an individualized treatment rule formalizes precision medicine as a map from patient information to a recommended treatment. a treatment rule is defined to be optimal if it maximizes the mean of a scalar outcome in a population of interest, e.g., symptom reduction. however, clinical and intervention scientists often must balance multiple and possibly competing outcomes, e.g., symptom reduction and the risk of an adverse event. one approach to precision medicine in this setting is to elicit a composite outcome which balances all competing outcomes; unfortunately, eliciting a composite outcome directly from patients is difficult without a high-quality instrument, and an expert-derived composite outcome may not account for heterogeneity in patient preferences. we propose a new paradigm for the study of precision medicine using observational data that relies solely on the assumption that clinicians are approximately (i.e., imperfectly) making decisions to maximize individual patient utility. estimated composite outcomes are subsequently used to construct an estimator of an individualized treatment rule which maximizes the mean of patient-specific composite outcomes. the estimated composite outcomes and estimated optimal individualized treatment rule provide new insights into patient preference heterogeneity, clinician behavior, and the value of precision medicine in a given domain. we derive inference procedures for the proposed estimators under mild conditions and demonstrate their finite sample performance through a suite of simulation experiments and an illustrative application to data from a study of bipolar depression.", "categories": "stat.ml", "doi": "", "created": "2017-11-28", "updated": "2019-01-23", "authors": ["daniel j. luckett", "eric b. laber", "michael r. kosorok"], "affiliation": [], "url": "https://arxiv.org/abs/1711.10581"}
{"title": "thermostat-assisted continuously-tempered hamiltonian monte carlo for   bayesian learning", "id": "1711.11511", "abstract": "we propose a new sampling method, the thermostat-assisted continuously-tempered hamiltonian monte carlo, for bayesian learning on large datasets and multimodal distributions. it simulates the nos\\'e-hoover dynamics of a continuously-tempered hamiltonian system built on the distribution of interest. a significant advantage of this method is that it is not only able to efficiently draw representative i.i.d. samples when the distribution contains multiple isolated modes, but capable of adaptively neutralising the noise arising from mini-batches and maintaining accurate sampling. while the properties of this method have been studied using synthetic distributions, experiments on three real datasets also demonstrated the gain of performance over several strong baselines with various types of neural networks plunged in.", "categories": "stat.ml", "doi": "", "created": "2017-11-30", "updated": "2019-01-28", "authors": ["rui luo", "jianhong wang", "yaodong yang", "zhanxing zhu", "jun wang"], "affiliation": [], "url": "https://arxiv.org/abs/1711.11511"}
{"title": "linearly-recurrent autoencoder networks for learning dynamics", "id": "1712.01378", "abstract": "this paper describes a method for learning low-dimensional approximations of nonlinear dynamical systems, based on neural-network approximations of the underlying koopman operator. extended dynamic mode decomposition (edmd) provides a useful data-driven approximation of the koopman operator for analyzing dynamical systems. this paper addresses a fundamental problem associated with edmd: a trade-off between representational capacity of the dictionary and over-fitting due to insufficient data. a new neural network architecture combining an autoencoder with linear recurrent dynamics in the encoded state is used to learn a low-dimensional and highly informative koopman-invariant subspace of observables. a method is also presented for balanced model reduction of over-specified edmd systems in feature space. nonlinear reconstruction using partially linear multi-kernel regression aims to improve reconstruction accuracy from the low-dimensional state when the data has complex but intrinsically low-dimensional structure. the techniques demonstrate the ability to identify koopman eigenfunctions of the unforced duffing equation, create accurate low-dimensional models of an unstable cylinder wake flow, and make short-time predictions of the chaotic kuramoto-sivashinsky equation.", "categories": "math.ds cs.lg stat.ml", "doi": "", "created": "2017-12-04", "updated": "2019-01-15", "authors": ["samuel e. otto", "clarence w. rowley"], "affiliation": [], "url": "https://arxiv.org/abs/1712.01378"}
{"title": "manifold-valued image generation with wasserstein generative adversarial   nets", "id": "1712.01551", "abstract": "generative modeling over natural images is one of the most fundamental machine learning problems. however, few modern generative models, including wasserstein generative adversarial nets (wgans), are studied on manifold-valued images that are frequently encountered in real-world applications. to fill the gap, this paper first formulates the problem of generating manifold-valued images and exploits three typical instances: hue-saturation-value (hsv) color image generation, chromaticity-brightness (cb) color image generation, and diffusion-tensor (dt) image generation. for the proposed generative modeling problem, we then introduce a theorem of optimal transport to derive a new wasserstein distance of data distributions on complete manifolds, enabling us to achieve a tractable objective under the wgan framework. in addition, we recommend three benchmark datasets that are cifar-10 hsv/cb color images, imagenet hsv/cb color images, ucl dt image datasets. on the three datasets, we experimentally demonstrate the proposed manifold-aware wgan model can generate more plausible manifold-valued images than its competitors.", "categories": "cs.cv stat.ml", "doi": "", "created": "2017-12-05", "updated": "2019-01-03", "authors": ["zhiwu huang", "jiqing wu", "luc van gool"], "affiliation": [], "url": "https://arxiv.org/abs/1712.01551"}
{"title": "parallel markov chain monte carlo for bayesian hierarchical models with   big data, in two stages", "id": "1712.05907", "abstract": "due to the escalating growth of big data sets in recent years, new bayesian markov chain monte carlo (mcmc) parallel computing methods have been developed. these methods partition large data sets by observations into subsets. however, for bayesian nested hierarchical models, typically only a few parameters are common for the full data set, with most parameters being group-specific. thus, parallel bayesian mcmc methods that take into account the structure of the model and split the full data set by groups rather than by observations are a more natural approach for analysis. here, we adapt and extend a recently introduced two-stage bayesian hierarchical modeling approach, and we partition complete data sets by groups. in stage 1, the group-specific parameters are estimated independently in parallel. the stage 1 posteriors are used as proposal distributions in stage 2, where the target distribution is the full model. using three-level and four-level models, we show in both simulation and real data studies that results of our method agree closely with the full data analysis, with greatly increased mcmc efficiency and greatly reduced computation times. the advantages of our method versus existing parallel mcmc computing methods are also described.", "categories": "stat.me cs.dc stat.co stat.ml", "doi": "", "created": "2017-12-16", "updated": "2019-01-16", "authors": ["zheng wei", "erin m. conlon"], "affiliation": [], "url": "https://arxiv.org/abs/1712.05907"}
{"title": "panoramic robust pca for foreground-background separation on noisy,   free-motion camera video", "id": "1712.06229", "abstract": "this work presents a new robust pca method for foreground-background separation on freely moving camera video with possible dense and sparse corruptions. our proposed method registers the frames of the corrupted video and then encodes the varying perspective arising from camera motion as missing data in a global model. this formulation allows our algorithm to produce a panoramic background component that automatically stitches together corrupted data from partially overlapping frames to reconstruct the full field of view. we model the registered video as the sum of a low-rank component that captures the background, a smooth component that captures the dynamic foreground of the scene, and a sparse component that isolates possible outliers and other sparse corruptions in the video. the low-rank portion of our model is based on a recent low-rank matrix estimator (optshrink) that has been shown to yield superior low-rank subspace estimates in practice. to estimate the smooth foreground component of our model, we use a weighted total variation framework that enables our method to reliably decouple the true foreground of the video from sparse corruptions. we perform extensive numerical experiments on both static and moving camera video subject to a variety of dense and sparse corruptions. our experiments demonstrate the state-of-the-art performance of our proposed method compared to existing methods both in terms of foreground and background estimation accuracy.", "categories": "stat.ml cs.cv", "doi": "", "created": "2017-12-17", "updated": "2019-01-03", "authors": ["brian e. moore", "chen gao", "raj rao nadakuditi"], "affiliation": [], "url": "https://arxiv.org/abs/1712.06229"}
{"title": "bayesian nonparametric causal inference: information rates and learning   algorithms", "id": "1712.08914", "abstract": "we investigate the problem of estimating the causal effect of a treatment on individual subjects from observational data, this is a central problem in various application domains, including healthcare, social sciences, and online advertising. within the neyman rubin potential outcomes model, we use the kullback leibler (kl) divergence between the estimated and true distributions as a measure of accuracy of the estimate, and we define the information rate of the bayesian causal inference procedure as the (asymptotic equivalence class of the) expected value of the kl divergence between the estimated and true distributions as a function of the number of samples. using fano method, we establish a fundamental limit on the information rate that can be achieved by any bayesian estimator, and show that this fundamental limit is independent of the selection bias in the observational data. we characterize the bayesian priors on the potential (factual and counterfactual) outcomes that achieve the optimal information rate. as a consequence, we show that a particular class of priors that have been widely used in the causal inference literature cannot achieve the optimal information rate. on the other hand, a broader class of priors can achieve the optimal information rate. we go on to propose a prior adaptation procedure (which we call the information based empirical bayes procedure) that optimizes the bayesian prior by maximizing an information theoretic criterion on the recovered causal effects rather than maximizing the marginal likelihood of the observed (factual) data. building on our analysis, we construct an information optimal bayesian causal inference algorithm.", "categories": "stat.me cs.lg stat.ml", "doi": "10.1109/jstsp.2018.2848230", "created": "2017-12-24", "updated": "2018-01-21", "authors": ["ahmed m. alaa", "mihaela van der schaar"], "affiliation": [], "url": "https://arxiv.org/abs/1712.08914"}
{"title": "deep learning for electromyographic hand gesture signal classification   using transfer learning", "id": "1801.07756", "abstract": "in recent years, deep learning algorithms have become increasingly more prominent for their unparalleled ability to automatically learn discriminant features from large amounts of data. however, within the field of electromyography-based gesture recognition, deep learning algorithms are seldom employed as they require an unreasonable amount of effort from a single person, to generate tens of thousands of examples.   this work's hypothesis is that general, informative features can be learned from the large amounts of data generated by aggregating the signals of multiple users, thus reducing the recording burden while enhancing gesture recognition. consequently, this paper proposes applying transfer learning on aggregated data from multiple users, while leveraging the capacity of deep learning algorithms to learn discriminant features from large datasets. two datasets comprised of 19 and 17 able-bodied participants respectively (the first one is employed for pre-training) were recorded for this work, using the myo armband. a third myo armband dataset was taken from the ninapro database and is comprised of 10 able-bodied participants. three different deep learning networks employing three different modalities as input (raw emg, spectrograms and continuous wavelet transform (cwt)) are tested on the second and third dataset. the proposed transfer learning scheme is shown to systematically and significantly enhance the performance for all three networks on the two datasets, achieving an offline accuracy of 98.31% for 7 gestures over 17 participants for the cwt-based convnet and 68.98% for 18 gestures over 10 participants for the raw emg-based convnet. finally, a use-case study employing eight able-bodied participants suggests that real-time feedback allows users to adapt their muscle activation strategy which reduces the degradation in accuracy normally experienced over time.", "categories": "cs.lg stat.ml", "doi": "", "created": "2018-01-10", "updated": "2019-01-25", "authors": ["ulysse côté-allard", "cheikh latyr fall", "alexandre drouin", "alexandre campeau-lecours", "clément gosselin", "kyrre glette", "françois laviolette", "benoit gosselin"], "affiliation": [], "url": "https://arxiv.org/abs/1801.07756"}
{"title": "correlated components analysis - extracting reliable dimensions in   multivariate data", "id": "1801.08881", "abstract": "how does one find dimensions in multivariate data that are reliably expressed across repetitions? for example, in a brain imaging study one may want to identify combinations of neural signals that are reliably expressed across multiple trials or subjects. for a behavioral assessment with multiple ratings, one may want to identify an aggregate score that is reliably reproduced across raters. correlated components analysis (corrca) addresses this problem by identifying components that are maximally correlated between repetitions (e.g. trials, subjects, raters). here we formalize this as the maximization of the ratio of between-repetition to within-repetition covariance. we show that this criterion maximizes repeat-reliability, defined as mean over variance across repeats, and that it leads to corrca or to multi-set canonical correlation analysis, depending on the constraints. surprisingly, we also find that corrca is equivalent to linear discriminant analysis for zero-mean signals, which provides an unexpected link between classic concepts of multivariate analysis. we present an exact parametric test of statistical significance based on the f-statistic for normally distributed independent samples, and present and validate shuffle statistics for the case of dependent samples. regularization and extension to non-linear mappings using kernels are also presented. the algorithms are demonstrated on a series of data analysis applications, and we provide all code and data required to reproduce the results.", "categories": "stat.ml cs.lg", "doi": "", "created": "2018-01-26", "updated": "2019-01-20", "authors": ["lucas c. parra", "stefan haufe", "jacek p. dmochowski"], "affiliation": [], "url": "https://arxiv.org/abs/1801.08881"}
{"title": "algorithmic linearly constrained gaussian processes", "id": "1801.09197", "abstract": "we algorithmically construct multi-output gaussian process priors which satisfy linear differential equations. our approach attempts to parametrize all solutions of the equations using gr\\\"obner bases. if successful, a push forward gaussian process along the paramerization is the desired prior. we consider several examples from physics, geomathematics and control, among them the full inhomogeneous system of maxwell's equations. by bringing together stochastic learning and computer algebra in a novel way, we combine noisy observations with precise algebraic computations.", "categories": "stat.ml cs.lg cs.sc math.ac", "doi": "", "created": "2018-01-28", "updated": "2019-01-04", "authors": ["markus lange-hegermann"], "affiliation": [], "url": "https://arxiv.org/abs/1801.09197"}
{"title": "matrix completion with deterministic pattern - a geometric perspective", "id": "1802.00047", "abstract": "we consider the matrix completion problem with a deterministic pattern of observed entries. in this setting, we aim to answer the question: under what condition there will be (at least locally) unique solution to the matrix completion problem, i.e., the underlying true matrix is identifiable. we answer the question from a certain point of view and outline a geometric perspective. we give an algebraically verifiable sufficient condition, which we call the well-posedness condition, for the local uniqueness of mrmc solutions. we argue that this condition is necessary for local stability of mrmc solutions, and we show that the condition is generic using the characteristic rank. we also argue that the low-rank approximation approaches are more stable than mrmc and further propose a sequential statistical testing procedure to determine the \"true\" rank from observed entries. finally, we provide numerical examples aimed at verifying validity of the presented theory.", "categories": "cs.lg math.st stat.ml stat.th", "doi": "10.1109/tsp.2018.2885494", "created": "2018-01-31", "updated": "2018-08-29", "authors": ["alexander shapiro", "yao xie", "rui zhang"], "affiliation": [], "url": "https://arxiv.org/abs/1802.00047"}
{"title": "predicting university students' academic success and major using random   forests", "id": "1802.03418", "abstract": "in this article, a large data set containing every course taken by every undergraduate student in a major university in canada over 10 years is analysed. modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. in this article, two classifiers are constructed using random forests. to begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. a classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. random forests also allow for reliable variable importance measurements. these measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students' situation. the results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.", "categories": "stat.ml cs.lg", "doi": "10.1007/s11162-019-09546-y", "created": "2018-02-09", "updated": "2019-01-12", "authors": ["cédric beaulac", "jeffrey s. rosenthal"], "affiliation": [], "url": "https://arxiv.org/abs/1802.03418"}
{"title": "nearly optimal adaptive procedure with change detection for   piecewise-stationary bandit", "id": "1802.03692", "abstract": "multi-armed bandit (mab) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. we consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. we show that by incorporating a simple change-detection component with classic ucb algorithms to detect and adapt to changes, our so-called m-ucb algorithm can achieve nearly optimal regret bound on the order of $o(\\sqrt{mkt\\log t})$, where $t$ is the number of time steps, $k$ is the number of arms, and $m$ is the number of stationary segments. comparison with the best available lower bound shows that our m-ucb is nearly optimal in $t$ up to a logarithmic factor. we also compare m-ucb with the state-of-the-art algorithms in numerical experiments using a public yahoo! dataset to demonstrate its superior performance.", "categories": "stat.ml cs.lg", "doi": "", "created": "2018-02-10", "updated": "2019-01-24", "authors": ["yang cao", "zheng wen", "branislav kveton", "yao xie"], "affiliation": [], "url": "https://arxiv.org/abs/1802.03692"}
{"title": "design of experiments for model discrimination hybridising analytical   and data-driven approaches", "id": "1802.04170", "abstract": "healthcare companies must submit pharmaceutical drugs or medical devices to regulatory bodies before marketing new technology. regulatory bodies frequently require transparent and interpretable computational modelling to justify a new healthcare technology, but researchers may have several competing models for a biological system and too little data to discriminate between the models. in design of experiments for model discrimination, the goal is to design maximally informative physical experiments in order to discriminate between rival predictive models. prior work has focused either on analytical approaches, which cannot manage all functions, or on data-driven approaches, which may have computational difficulties or lack interpretable marginal predictive distributions. we develop a methodology introducing gaussian process surrogates in lieu of the original mechanistic models. we thereby extend existing design and model discrimination methods developed for analytical models to cases of non-analytical models in a computationally efficient manner.", "categories": "stat.ap stat.ml", "doi": "", "created": "2018-02-12", "updated": "2018-05-31", "authors": ["simon olofsson", "marc peter deisenroth", "ruth misener"], "affiliation": [], "url": "https://arxiv.org/abs/1802.04170"}
{"title": "adversarially regularized graph autoencoder for graph embedding", "id": "1802.04407", "abstract": "graph embedding is an effective method to represent graph data in a low dimensional space for graph analytics. most existing embedding algorithms typically focus on preserving the topological structure or minimizing the reconstruction errors of graph data, but they have mostly ignored the data distribution of the latent codes from the graphs, which often results in inferior embedding in real-world graph data. in this paper, we propose a novel adversarial graph embedding framework for graph data. the framework encodes the topological structure and node content in a graph to a compact representation, on which a decoder is trained to reconstruct the graph structure. furthermore, the latent representation is enforced to match a prior distribution via an adversarial training scheme. to learn a robust embedding, two variants of adversarial approaches, adversarially regularized graph autoencoder (arga) and adversarially regularized variational graph autoencoder (arvga), are developed. experimental studies on real-world graphs validate our design and demonstrate that our algorithms outperform baselines by a wide margin in link prediction, graph clustering, and graph visualization tasks.", "categories": "cs.lg stat.ml", "doi": "", "created": "2018-02-12", "updated": "2019-01-07", "authors": ["shirui pan", "ruiqi hu", "guodong long", "jing jiang", "lina yao", "chengqi zhang"], "affiliation": [], "url": "https://arxiv.org/abs/1802.04407"}
{"title": "gilbo: one metric to measure them all", "id": "1802.04874", "abstract": "we propose a simple, tractable lower bound on the mutual information contained in the joint generative density of any latent variable generative model: the gilbo (generative information lower bound). it offers a data-independent measure of the complexity of the learned latent variable description, giving the log of the effective description length. it is well-defined for both vaes and gans. we compute the gilbo for 800 gans and vaes each trained on four datasets (mnist, fashionmnist, cifar-10 and celeba) and discuss the results.", "categories": "stat.ml cs.lg", "doi": "", "created": "2018-02-13", "updated": "2019-01-10", "authors": ["alexander a. alemi", "ian fischer"], "affiliation": [], "url": "https://arxiv.org/abs/1802.04874"}
{"title": "quantum variational autoencoder", "id": "1802.05779", "abstract": "variational autoencoders (vaes) are powerful generative models with the salient ability to perform inference. here, we introduce a quantum variational autoencoder (qvae): a vae whose latent generative process is implemented as a quantum boltzmann machine (qbm). we show that our model can be trained end-to-end by maximizing a well-defined loss-function: a 'quantum' lower-bound to a variational approximation of the log-likelihood. we use quantum monte carlo (qmc) simulations to train and evaluate the performance of qvaes. to achieve the best performance, we first create a vae platform with discrete latent space generated by a restricted boltzmann machine (rbm). our model achieves state-of-the-art performance on the mnist dataset when compared against similar approaches that only involve discrete variables in the generative process. we consider qvaes with a smaller number of latent units to be able to perform qmc simulations, which are computationally expensive. we show that qvaes can be trained effectively in regimes where quantum effects are relevant despite training via the quantum bound. our findings open the way to the use of quantum computers to train qvaes to achieve competitive performance for generative models. placing a qbm in the latent space of a vae leverages the full potential of current and next-generation quantum computers as sampling devices.", "categories": "quant-ph cs.lg stat.ml", "doi": "10.1088/2058-9565/aada1f", "created": "2018-02-15", "updated": "2019-01-12", "authors": ["amir khoshaman", "walter vinci", "brandon denis", "evgeny andriyash", "hossein sadeghi", "mohammad h. amin"], "affiliation": [], "url": "https://arxiv.org/abs/1802.05779"}
{"title": "anomaly detection using one-class neural networks", "id": "1802.06360", "abstract": "we propose a one-class neural network (oc-nn) model to detect anomalies in complex data sets. oc-nn combines the ability of deep networks to extract a progressively rich representation of data with the one-class objective of creating a tight envelope around normal data. the oc-nn approach breaks new ground for the following crucial reason: data representation in the hidden layer is driven by the oc-nn objective and is thus customized for anomaly detection. this is a departure from other approaches which use a hybrid approach of learning deep features using an autoencoder and then feeding the features into a separate anomaly detection method like one-class svm (oc-svm). the hybrid oc-svm approach is sub-optimal because it is unable to influence representational learning in the hidden layers. a comprehensive set of experiments demonstrate that on complex data sets (like cifar and gtsrb), oc-nn performs on par with state-of-the-art methods and outperformed conventional shallow methods in some scenarios.", "categories": "cs.lg cs.ne stat.ml", "doi": "", "created": "2018-02-18", "updated": "2019-01-10", "authors": ["raghavendra chalapathy", "aditya krishna menon", "sanjay chawla"], "affiliation": ["university of sydney and capital markets cooperative research centre", "data61/csiro and the australian national university", "qatar computing research institute"], "url": "https://arxiv.org/abs/1802.06360"}
{"title": "guaranteed recovery of one-hidden-layer neural networks via cross   entropy", "id": "1802.06463", "abstract": "we study model recovery for data classification, where the training labels are generated from a one-hidden-layer neural network with sigmoid activations, and the goal is to recover the weights of the neural network. we consider two network models, the fully-connected network (fcn) and the non-overlapping convolutional neural network (cnn). we prove that with gaussian inputs, the empirical risk based on cross entropy exhibits strong convexity and smoothness {\\em uniformly} in a local neighborhood of the ground truth, as soon as the sample complexity is sufficiently large. this implies that if initialized in this neighborhood, gradient descent converges linearly to a critical point that is provably close to the ground truth. furthermore, we show such an initialization can be obtained via the tensor method. this establishes the global convergence guarantee for empirical risk minimization using cross entropy via gradient descent for learning one-hidden-layer neural networks, at the near-optimal sample and computational complexity with respect to the network input dimension without unrealistic assumptions such as requiring a fresh set of samples at each iteration.", "categories": "stat.ml cs.lg", "doi": "", "created": "2018-02-18", "updated": "2019-01-19", "authors": ["haoyu fu", "yuejie chi", "yingbin liang"], "affiliation": [], "url": "https://arxiv.org/abs/1802.06463"}
{"title": "computation of optimal transport and related hedging problems via   penalization and neural networks", "id": "1802.08539", "abstract": "this paper presents a widely applicable approach to solving (multi-marginal, martingale) optimal transport and related problems via neural networks. the core idea is to penalize the optimization problem in its dual formulation and reduce it to a finite dimensional one which corresponds to optimizing a neural network with smooth objective function. we present numerical examples from optimal transport, martingale optimal transport, portfolio optimization under uncertainty and generative adversarial networks that showcase the generality and effectiveness of the approach.", "categories": "math.oc q-fin.mf stat.ml", "doi": "", "created": "2018-02-23", "updated": "2019-01-25", "authors": ["stephan eckstein", "michael kupper"], "affiliation": [], "url": "https://arxiv.org/abs/1802.08539"}
{"title": "on oracle-efficient pac rl with rich observations", "id": "1803.00606", "abstract": "we study the computational tractability of pac reinforcement learning with rich observations. we present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. these methods operate in an oracle model of computation -- accessing policy and value function classes exclusively through standard optimization primitives -- and therefore represent computationally efficient alternatives to prior algorithms that require enumeration. with stochastic hidden state dynamics, we prove that the only known sample-efficient algorithm, olive, cannot be implemented in the oracle model. we also present several examples that illustrate fundamental challenges of tractable pac reinforcement learning in such general settings.", "categories": "cs.lg stat.ml", "doi": "", "created": "2018-03-01", "updated": "2019-01-16", "authors": ["christoph dann", "nan jiang", "akshay krishnamurthy", "alekh agarwal", "john langford", "robert e. schapire"], "affiliation": [], "url": "https://arxiv.org/abs/1803.00606"}
{"title": "gan-based synthetic medical image augmentation for increased cnn   performance in liver lesion classification", "id": "1803.01229", "abstract": "deep learning methods, and in particular convolutional neural networks (cnns), have led to an enormous breakthrough in a wide range of computer vision tasks, primarily by using large-scale annotated datasets. however, obtaining such datasets in the medical domain remains a challenge. in this paper, we present methods for generating synthetic medical images using recently presented deep learning generative adversarial networks (gans). furthermore, we show that generated medical images can be used for synthetic data augmentation, and improve the performance of cnn for medical image classification. our novel method is demonstrated on a limited dataset of computed tomography (ct) images of 182 liver lesions (53 cysts, 64 metastases and 65 hemangiomas). we first exploit gan architectures for synthesizing high quality liver lesion rois. then we present a novel scheme for liver lesion classification using cnn. finally, we train the cnn using classic data augmentation and our synthetic data augmentation and compare performance. in addition, we explore the quality of our synthesized examples using visualization and expert assessment. the classification performance using only classic data augmentation yielded 78.6% sensitivity and 88.4% specificity. by adding the synthetic data augmentation the results increased to 85.7% sensitivity and 92.4% specificity. we believe that this approach to synthetic data augmentation can generalize to other medical classification applications and thus support radiologists' efforts to improve diagnosis.", "categories": "cs.cv cs.lg stat.ml", "doi": "10.1016/j.neucom.2018.09.013", "created": "2018-03-03", "updated": "", "authors": ["maayan frid-adar", "idit diamant", "eyal klang", "michal amitai", "jacob goldberger", "hayit greenspan"], "affiliation": [], "url": "https://arxiv.org/abs/1803.01229"}
{"title": "learning filter bank sparsifying transforms", "id": "1803.01980", "abstract": "data is said to follow the transform (or analysis) sparsity model if it becomes sparse when acted on by a linear operator called a sparsifying transform. several algorithms have been designed to learn such a transform directly from data, and data-adaptive sparsifying transforms have demonstrated excellent performance in signal restoration tasks. sparsifying transforms are typically learned using small sub-regions of data called patches, but these algorithms often ignore redundant information shared between neighboring patches.   we show that many existing transform and analysis sparse representations can be viewed as filter banks, thus linking the local properties of patch-based model to the global properties of a convolutional model. we propose a new transform learning framework where the sparsifying transform is an undecimated perfect reconstruction filter bank. unlike previous transform learning algorithms, the filter length can be chosen independently of the number of filter bank channels. numerical results indicate filter bank sparsifying transforms outperform existing patch-based transform learning for image denoising while benefiting from additional flexibility in the design process.", "categories": "stat.ml cs.lg eess.sp", "doi": "10.1109/tsp.2018.2883021", "created": "2018-03-05", "updated": "", "authors": ["luke pfister", "yoram bresler"], "affiliation": [], "url": "https://arxiv.org/abs/1803.01980"}
{"title": "transfer learning with neural automl", "id": "1803.02780", "abstract": "we reduce the computational cost of neural automl with transfer learning. automl relieves human effort by automating the design of ml algorithms. neural automl has become popular for the design of deep learning architectures, however, this method has a high computation cost. to address this we propose transfer neural automl that uses knowledge from prior tasks to speed up network design. we extend rl-based architecture search methods to support parallel training on multiple tasks and then transfer the search strategy to new tasks. on language and image classification tasks, transfer neural automl reduces convergence time over single-task training by over an order of magnitude on many tasks.", "categories": "cs.lg stat.ml", "doi": "", "created": "2018-03-07", "updated": "2019-01-28", "authors": ["catherine wong", "neil houlsby", "yifeng lu", "andrea gesmundo"], "affiliation": [], "url": "https://arxiv.org/abs/1803.02780"}
{"title": "phasenet: a deep-neural-network-based seismic arrival time picking   method", "id": "1803.03211", "abstract": "as the number of seismic sensors grows, it is becoming increasingly difficult for analysts to pick seismic phases manually and comprehensively, yet such efforts are fundamental to earthquake monitoring. despite years of improvements in automatic phase picking, it is difficult to match the performance of experienced analysts. a more subtle issue is that different seismic analysts may pick phases differently, which can introduce bias into earthquake locations. we present a deep-neural-network-based arrival-time picking method called \"phasenet\" that picks the arrival times of both p and s waves. deep neural networks have recently made rapid progress in feature learning, and with sufficient training, have achieved super-human performance in many applications. phasenet uses three-component seismic waveforms as input and generates probability distributions of p arrivals, s arrivals, and noise as output. we engineer phasenet such that peaks in probability provide accurate arrival times for both p and s waves, and have the potential to increase the number of s-wave observations dramatically over what is currently available. this will enable both improved locations and improved shear wave velocity models. phasenet is trained on the prodigious available data set provided by analyst-labeled p and s arrival times from the northern california earthquake data center. the dataset we use contains more than seven million waveform samples extracted from over thirty years of earthquake recordings. we demonstrate that phasenet achieves much higher picking accuracy and recall rate than existing methods.", "categories": "physics.geo-ph stat.ap", "doi": "10.1093/gji/ggy423", "created": "2018-03-08", "updated": "", "authors": ["weiqiang zhu", "gregory c. beroza"], "affiliation": [], "url": "https://arxiv.org/abs/1803.03211"}
{"title": "stochastic learning under random reshuffling with constant step-sizes", "id": "1803.07964", "abstract": "in empirical risk optimization, it has been observed that stochastic gradient implementations that rely on random reshuffling of the data achieve better performance than implementations that rely on sampling the data uniformly. recent works have pursued justifications for this behavior by examining the convergence rate of the learning process under diminishing step-sizes. this work focuses on the constant step-size case and strongly convex loss function. in this case, convergence is guaranteed to a small neighborhood of the optimizer albeit at a linear rate. the analysis establishes analytically that random reshuffling outperforms uniform sampling by showing explicitly that iterates approach a smaller neighborhood of size $o(\\mu^2)$ around the minimizer rather than $o(\\mu)$. furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail the differences between sampling with and without replacement. we also explain the periodic behavior that is observed in random reshuffling implementations.", "categories": "cs.lg math.oc stat.ml", "doi": "10.1109/tsp.2018.2878551", "created": "2018-03-21", "updated": "2018-10-09", "authors": ["bicheng ying", "kun yuan", "stefan vlaski", "ali h. sayed"], "affiliation": [], "url": "https://arxiv.org/abs/1803.07964"}
{"title": "seglearn: a python package for learning sequences and time series", "id": "1803.08118", "abstract": "seglearn is an open-source python package for machine learning time series or sequences using a sliding window segmentation approach. the implementation provides a flexible pipeline for tackling classification, regression, and forecasting problems with multivariate sequence and contextual data. this package is compatible with scikit-learn and is listed under scikit-learn related projects. the package depends on numpy, scipy, and scikit-learn. seglearn is distributed under the bsd 3-clause license. documentation includes a detailed api description, user guide, and examples. unit tests provide a high degree of code coverage.", "categories": "stat.ml cs.lg", "doi": "", "created": "2018-03-21", "updated": "2018-10-18", "authors": ["david m. burns", "cari m. whyne"], "affiliation": [], "url": "https://arxiv.org/abs/1803.08118"}
{"title": "structured output learning with abstention: application to accurate   opinion prediction", "id": "1803.08355", "abstract": "motivated by supervised opinion analysis, we propose a novel framework devoted to structured output learning with abstention (sola). the structure prediction model is able to abstain from predicting some labels in the structured output at a cost chosen by the user in a flexible way. for that purpose, we decompose the problem into the learning of a pair of predictors, one devoted to structured abstention and the other, to structured output prediction. to compare fully labeled training data with predictions potentially containing abstentions, we define a wide class of asymmetric abstention-aware losses. learning is achieved by surrogate regression in an appropriate feature space while prediction with abstention is performed by solving a new pre-image problem. thus, sola extends recent ideas about structured output prediction via surrogate problems and calibration theory and enjoys statistical guarantees on the resulting excess risk. instantiated on a hierarchical abstention-aware loss, sola is shown to be relevant for fine-grained opinion mining and gives state-of-the-art results on this task. moreover, the abstention-aware representations can be used to competitively predict user-review ratings based on a sentence-level opinion predictor.", "categories": "cs.lg cs.ai stat.ml", "doi": "", "created": "2018-03-22", "updated": "2018-06-08", "authors": ["alexandre garcia", "slim essid", "chloé clavel", "florence d'alché-buc"], "affiliation": [], "url": "https://arxiv.org/abs/1803.08355"}
{"title": "calibrated prediction intervals for neural network regressors", "id": "1803.09546", "abstract": "ongoing developments in neural network models are continually advancing the state of the art in terms of system accuracy. however, the predicted labels should not be regarded as the only core output; also important is a well-calibrated estimate of the prediction uncertainty. such estimates and their calibration are critical in many practical applications. despite their obvious aforementioned advantage in relation to accuracy, contemporary neural networks can, generally, be regarded as poorly calibrated and as such do not produce reliable output probability estimates. further, while post-processing calibration solutions can be found in the relevant literature, these tend to be for systems performing classification. in this regard, we herein present two novel methods for acquiring calibrated predictions intervals for neural network regressors: empirical calibration and temperature scaling. in experiments using different regression tasks from the audio and computer vision domains, we find that both our proposed methods are indeed capable of producing calibrated prediction intervals for neural network regressors with any desired confidence level, a finding that is consistent across all datasets and neural network architectures we experimented with. in addition, we derive an additional practical recommendation for producing more accurate calibrated prediction intervals. we release the source code implementing our proposed methods for computing calibrated predicted intervals. the code for computing calibrated predicted intervals is publicly available.", "categories": "stat.ml cs.lg", "doi": "", "created": "2018-03-26", "updated": "2019-01-07", "authors": ["gil keren", "nicholas cummins", "björn schuller"], "affiliation": [], "url": "https://arxiv.org/abs/1803.09546"}
{"title": "revisiting skip-gram negative sampling model with rectification", "id": "1804.00306", "abstract": "we revisit skip-gram negative sampling (sgns), one of the most popular neural-network based approaches to learning distributed word representation. we first point out the ambiguity issue undermining the sgns model, in the sense that the word vectors can be entirely distorted without changing the objective value. to resolve the issue, we investigate the intrinsic structures in solution that a good word embedding model should deliver. motivated by this, we rectify the sgns model with quadratic regularization, and show that this simple modification suffices to structure the solution in the desired manner. a theoretical justification is presented, which provides novel insights into quadratic regularization . preliminary experiments are also conducted on google's analytical reasoning task to support the modified sgns model.", "categories": "cs.cl cs.lg stat.ml", "doi": "", "created": "2018-04-01", "updated": "2019-01-14", "authors": ["cun mu", "guang yang", "zheng yan"], "affiliation": [], "url": "https://arxiv.org/abs/1804.00306"}
{"title": "recall traces: backtracking models for efficient reinforcement learning", "id": "1804.00379", "abstract": "in many environments only a tiny subset of all states yield high reward. in these cases, few of the interactions with the environment provide a relevant learning signal. hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. to this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. we can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. these traces of (state, action) pairs, which we refer to as recall traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. we provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. our method improves the sample efficiency of both on- and off-policy rl algorithms across several environments and tasks.", "categories": "cs.lg stat.ml", "doi": "", "created": "2018-04-01", "updated": "2019-01-28", "authors": ["anirudh goyal", "philemon brakel", "william fedus", "soumye singhal", "timothy lillicrap", "sergey levine", "hugo larochelle", "yoshua bengio"], "affiliation": [], "url": "https://arxiv.org/abs/1804.00379"}