Overview of the application of kernel mean embedding in [ ]. Together with the reduced set (RS) techniques to limit the complexity of the RKHS expansion, the kernel mean embedding is used to approximate the embedding of the functional of random variables $Z=f(X,Y)$.

A Hilbert space embedding of distributions (KME)---which generalizes the feature map of individual points to probability measures---has emerged as a powerful machinery for probabilistic modeling, machine learning, and causal discovery. The idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) endowed with a kernel $k$. It enables us to apply RKHS methods to probability measures and has given rise to a great deal of research and novel applications of kernel methods.

Given an i.i.d.\ sample $x_1,x_2,\ldots,x_n$ from $\mathbb{P}$, the most natural estimate of the embedding $\mu_{\mathbb{P}}=\mathbb{E}_{\mathbb{P}}[k(X,\cdot)]$ is an empirical average $\hat{\mu}_{\mathbb{P}}=(1/n)\sum_{i=1}^n k(x_i,\cdot)$. In [ ], we showed that this estimator is not optimal in a certain sense. Inspired by James-Stein estimator, we proposed the so-called kernel mean shrinkage estimators (KMSEs) which improves upon the standard estimator. A suitable explanation for the improvement is a bias-variance tradeoff: the shrinkage estimator reduces variance substantially at the expense of a small bias. In addition, we presented a class of estimators called spectral shrinkage estimators in [ ] which also incorporates the RKHS structure via the eigenspectrum of the empirical covariance operator. Our empirical studies suggest that the proposed estimators are very useful for ``large $p$, small $n$'' situations (e.g. medical data, gene expression analysis, and text documents).

A natural application of KME is in testing for similarities between samples from distributions. We refer to the distance between two distribution embeddings as the maximum mean discrepancy (MMD). We have formulated a two-sample test [ ] (of whether two distributions are the same), and showed that the independence test (of whether two random variables observed together are statistically independent) is a special case. A further application of the MMD as independence criterion is in feature selection, where we maximize dependence between features and labels [ ]. We have further developed alternative independence tests based on space partitioning approaches and classical divergence measures (such as the $\ell_1$ distance and KL-divergence) [ ]. Lastly, we also constructed the test for non-i.i.d. data such as time-series in [ ].

Given that the MMD depends on the particular kernel that is chosen, we proposed two kernel selection strategies [ ], the earlier one relying on a classification interpretation of the MMD, and the later one explicitly minimizing the probability of Type II error of the associated two-sample test (that is, the probability of wrongly accepting that two unlike distributions are the same, given samples from each).

We have also used the KME to develop a variant of an SVM which operates on distributions rather than points [ ], permitting modeling of input uncertainties. One can prove a generalized representer theorem for this case, and in the special case of Gaussian input uncertainties and Gaussian kernel SVMs, it leads to a multi-scale SVM, akin to an RBF network with variable widths, which is still trained by solving a quadratic optimization problem. In [ ], we applied this framework to perform bivariate causal inference between $X$ and $Y$ as a classification problem on joint distribution $\mathbb{P}(X,Y)$. Another interesting application is in domain adaptation [ ]. This idea has also been extended to develop a variant of One-class SVM that operates on distributions, leading to applications in group anomaly detection [ ].

A recent application uses kernel means in visualization. When using a power-of-cosine kernel for distributions on the projective sphere, the kernel mean can be represented as a symmetric tensor. In the context of diffusion MRI, this permits an efficient visual and quantitative analysis of the uncertainty in nerve fiber estimates, which can inform the choice of MR acquisition schemes and mathematical models [ ].

A natural question to consider is whether the MMD constitutes a metric on distributions, and is zero if and only if the distributions are the same. When this holds, the RKHS is said to be characteristic. We have determined necessary and sufficient conditions on translation invariant kernels for injectivity, for distributions on compact and non-compact subsets of $\mathbb{R}^d$ [ ]: specifically, the Fourier transform of the kernel should be supported on all of $\mathbb{R}^d$. Gaussian, Laplace, and B-spline kernels satisfy this requirement. The MMD is a member of a larger class of metrics on distributions, known as the integral probability metrics (IPMs). In [16, 4], we provide estimates of IPMs on $\mathbb{R}^d$ which are taken over function classes that are not RKHSs, namely the Wasserstein distance (functions in the unit Lipschitz semi-norm ball) and the Dudley metric (functions in the unit bounded Lipschitz norm ball), and establish strong consistency of our estimators. Comparing the MMD and these two distances, the MMD converges fastest, and at a rate independent of the dimensionality $d$ of the random variables -- by contrast, rates for the classical Wasserstein and Dudley metrics worsen when $d$ grows.

Embeddings of distributions can be generalized to yield embeddings of conditional distributions. The first application is to Bayesian inference on graphical models. We have developed two approaches: in the first [ ], the messages are conditional density functions, subject to smoothness constraints; these were orders of magnitude faster than competing nonparametric BP approaches, yet more accurate, on problems including depth reconstruction from 2-D images and robot orientation recovery. In the second approach [ ], conditional distributions $P(Y|X=x)$ are represented directly as embeddings in the RKHS, allowing greater generality (for instance, one can define distributions over structured objects such as strings or graphs, for which probability densities may not exist). We showed the conditional mean embedding to be a solution to a vector valued regression problem [ ], which allows us to formulate sparse estimates. The second application is to reinforcement learning. In [ ], we estimate the optimal value function for a Markov decision process using conditional distribution embeddings, and the associated policy. This work was generalized to partially observable Markov decision processes in [ ], where the kernel Bayesâ rule was used to integrate over distributions of the hidden states.

Another important application of conditional mean embeddings is in testing for conditional independence (CI). We proposed a Kernel-based Conditional Independence test (KCI-test) [ ] which avoids the classical drawbacks of CI testing. Most importantly, we further derived its asymptotic distribution under the null hypothesis, and provided ways to estimate such a distribution. Our method is computationally appealing and is less sensitive to the dimensionality of $Z$ compared to other methods. This is the first time that the null distribution of the kernel-based statistic for CI testing has been derived. Recently, we proposed a new permutation-based CI test [ ] that easily allows the incorporation of prior knowledge during the permutation step, has power competitive with state-of-the-art kernel CI tests, and accurately estimates the null distribution of the test statistic, even as the dimensionality of the conditioning variable grows.

Lastly, we have recently leveraged the KME in computing functionals of random variables $Z=f(X_1,X_2,\ldots,X_n)$ [ ], which is ubiquitous in various applications such as probabilistic programming. Our approach allows us to obtain the distribution embedding of $Z$ directly from the embeddings of $X_1,X_2,\ldots,X_n$ without resorting to density estimation. It is in principle applicable to all functional operations and data types, thank to the generality of kernel methods. Based on the proposed framework, we showed how it can be applied to non-parametric structural equation models, with an application to causal inference. As an aside, we have also developed algorithms based on distribution embedding for identifying confounders [ ], which is one of the most fundamental problems in causal inference.

34 results

**Kernel Mean Shrinkage Estimators**
*Journal of Machine Learning Research*, 17(48):1-41, 2016 (article)

**Towards a Learning Theory of Cause-Effect Inference**
In *Proceedings of the 32nd International Conference on Machine Learning*, 37, pages: 1452–1461, JMLR Workshop and Conference Proceedings, (Editors: F. Bach and D. Blei), JMLR, ICML, 2015 (inproceedings)

**The Randomized Causation Coefficient**
*Journal of Machine Learning*, 2015 (article) To be published

**Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations**
*Statistics and Computing *, 25(4):755-766, 2015 (article)

**A Permutation-Based Kernel Conditional Independence Test**
In *Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI2014)*, pages: 132-141, (Editors: Nevin L. Zhang and Jin Tian), AUAI Press Corvallis, Oregon, UAI2014, 2014 (inproceedings)

**Kernel Mean Estimation via Spectral Filtering**
In *Advances in Neural Information Processing Systems 27*, pages: 1-9, (Editors: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger), Curran Associates, Inc., 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014 (inproceedings)

**Single-Source Domain Adaptation with Target and Conditional Shift**
In *Regularization, Optimization, Kernels, and Support Vector Machines*, pages: 427-456, 19, Chapman & Hall/CRC Machine Learning & Pattern Recognition, (Editors: Suykens, J. A. K., Signoretto, M. and Argyriou, A.), Chapman and Hall/CRC, Boca Raton, USA, 2014 (inbook)

**Kernel Mean Estimation and Stein Effect**
In *Proceedings of the 31st International Conference on Machine Learning, W&CP 32 (1)*, pages: 10-18, (Editors: Eric P. Xing and Tony Jebara), JMLR, ICML, 2014 (inproceedings)

**Visualizing Uncertainty in HARDI Tractography Using Superquadric Streamtubes**
In *Eurographics Conference on Visualization, Short Papers*, (Editors: Elmqvist, N. and Hlawitschka, M. and Kennedy, J.), EuroVis, 2014 (inproceedings)

**Causal discovery via reproducing kernel Hilbert space embeddings**
*Neural Computation*, 26(7):1484-1517, 2014 (article)

**Domain adaptation under Target and Conditional Shift**
In *Proceedings of the 30th International Conference on Machine Learning, W&CP 28 (3)*, pages: 819–827, (Editors: S Dasgupta and D McAllester), JMLR, ICML, 2013 (inproceedings)

**One-class Support Measure Machines for Group Anomaly Detection**
In *Proceedings 29th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages: 449-458, (Editors: Ann Nicholson and Padhraic Smyth), AUAI Press, Corvallis, Oregon, UAI, 2013 (inproceedings)

**Domain Generalization via Invariant Feature Representation**
In *Proceedings of the 30th International Conference on Machine Learning, W&CP 28(1)*, pages: 10-18, (Editors: S Dasgupta and D McAllester), JMLR, ICML, 2013, Volume 28, number 1 (inproceedings)

**HiFiVE: A Hilbert Space Embedding of Fiber Variability Estimates for Uncertainty Modeling and Visualization**
*Computer Graphics Forum*, 32(3):121-130, (Editors: B Preim, P Rheingans, and H Theisel), Blackwell Publishing, Oxford, UK, Eurographics Conference on Visualization (EuroVis), 2013 (article)

**Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators**
In *Advances in Neural Information Processing Systems 26*, pages: 2535-2543, (Editors: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger), 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013 (inproceedings)

**Identifying Finite Mixtures of Nonparametric Product Distributions and Causal Inference of Confounders **
In *Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages: 556-565, (Editors: A Nicholson and P Smyth), AUAI Press Corvallis, Oregon, USA, UAI, 2013 (inproceedings)

**A Kernel Two-Sample Test **
*Journal of Machine Learning Research*, 13, pages: 723-773, March 2012 (article)

**Feature Selection via Dependence Maximization**
*Journal of Machine Learning Research*, 13, pages: 1393-1434, May 2012 (article)

**On the Empirical Estimation of Integral Probability Metrics**
*Electronic Journal of Statistics*, 6, pages: 1550-1599, 2012 (article)

**Optimal kernel choice for large-scale two-sample tests**
In *Advances in Neural Information Processing Systems 25*, pages: 1214-1222, (Editors: P Bartlett and FCN Pereira and CJC. Burges and L Bottou and KQ Weinberger), Curran Associates Inc., 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012 (inproceedings)

**Conditional mean embeddings as regressors**
In *Proceedings of the 29th International Conference on Machine Learning*, pages: 1823-1830, (Editors: J Langford and J Pineau), Omnipress, New York, NY, USA, ICML, 2012 (inproceedings)

**Modelling transition dynamics in MDPs with RKHS embeddings**
In *Proceedings of the 29th International Conference on Machine Learning*, pages: 535-542, (Editors: J Langford and J Pineau), Omnipress, New York, NY, USA, ICML, 2012 (inproceedings)

**Learning from distributions via support measure machines**
In *Advances in Neural Information Processing Systems 25*, pages: 10-18, (Editors: P Bartlett, FCN Pereira, CJC. Burges, L Bottou, and KQ Weinberger), Curran Associates Inc., 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012 (inproceedings)

**Hilbert Space Embeddings of POMDPs**
In Conference on Uncertainty in Artificial Intelligence (UAI), 2012 (inproceedings)

**Hilbert space embedding for Dirichlet Process mixtures**
In NIPS Workshop on confluence between kernel methods and graphical models, 2012 (inproceedings) To be published

**Kernel Bayes’ Rule**
In *Advances in Neural Information Processing Systems 24*, pages: 1737-1745, (Editors: J Shawe-Taylor and RS Zemel and P Bartlett and F Pereira and KQ Weinberger), Curran Associates, Inc., Red Hook, NY, USA, Twenty-Fifth Annual Conference on Neural Information Processing Systems (NIPS), 2011 (inproceedings)

**Kernel Belief Propagation**
In *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Vol. 15*, pages: 707-715, (Editors: G Gordon and D Dunson and M Dudík), JMLR, AISTATS, 2011 (inproceedings)

**Kernel-based Conditional Independence Test and Application in Causal Discovery**
In pages: 804-813, (Editors: FG Cozman and A Pfeffer), AUAI Press, Corvallis, OR, USA, 27th Conference on Uncertainty in Artificial Intelligence (UAI), July 2011 (inproceedings)

**Consistent Nonparametric Tests of Independence**
*Journal of Machine Learning Research*, 11, pages: 1391-1423, 2010 (article)

**Hilbert Space Embeddings and Metrics on Probability Measures**
*Journal of Machine Learning Research*, 11, pages: 1517-1561, April 2010 (article)

**Nonparametric Tree Graphical Models**
In *Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Volume 9 *, pages: 765-772, (Editors: YW Teh and M Titterington ), JMLR, AISTATS, 2010 (inproceedings)

**Non-parametric estimation of integral probability metrics**
In *Proceedings of the IEEE International Symposium on Information Theory (ISIT 2010)*, pages: 1428-1432, IEEE, Piscataway, NJ, USA, IEEE International Symposium on Information Theory (ISIT), June 2010 (inproceedings)

**Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions**
In *Advances in Neural Information Processing Systems 22*, pages: 1750-1758, (Editors: Y Bengio and D Schuurmans and J Lafferty and C Williams and A Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)

**Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches**
19th International Conference on Algorithmic Learning Theory (ALT08), October 2008 (talk)