Philipp Hennig (Group Leader),
Simon Bartels,
Hans Kersting,
Maren Mahsereci,
Michael Schober,
Edgar Klenske

- Philipp was recently interviewed for
*Talking Machines* - Maren Mahsereci's work on probabilistic line-searches is selected for a full oral at NIPS 2015
- We are co-organizing a NIPS workshop on Probabilistic Integration at NIPS 2015
- Michael Schober's piece on the probabilistic interpretation of Runge-Kutta methods is selected for a full oral at NIPS 2014.

Artificial intelligent systems build models of their environment from observations, and choose actions that they predict will have beneficial effect on the environment's state. A central challenge is that the mathematical models used in this process call for computations that have no closed, analytic solution. Learning machines thus rely on a whole toolbox of *numerical methods*:* *High-dimensional i*ntegration *routines* *are used for marginalization and conditioning in probabilistic models. Fitting of parameters poses nonlinear (and often non-convex) *optimization* problems. And the prediction of dynamic changes in the environment involves various forms of *differential equations*. In addition, there are special cases for each of these areas in which the computation amounts to large-scale *linear algebra* (i.e. Gaussian conditioning, least-squares optimization, linear differential equations). Traditionally, machine learning researchers have served these needs by taking numerical methods "off the shelf" and treating them as black boxes.

Since the 1970s, researchers like Wahba, Diaconis and O'Hagan repeatedly pointed out that, in fact, numerical methods can themselves be interpreted as statistical estimation methods -- as acting, learning machines: Numerical methods *estimate* an unknown, intractable, latent quantity *given* computable, tractable, observable quantities. For example, an integration method estimates the value of an integral given access to evaluations of the integrand. This is a rather abstract observation, but Diaconis and O'Hagan separately made a precise connection between inference and compuation in the case of integration: Several classic quadrature rules, e.g. the trapezoidal rule, can be interpreted as the maximum a posteriori (MAP) estimator arising from a certain family of Gaussian process priors on the integrand.

Over recent years, the research group on probabilistic numerics at the MPI IS (since 2015 supported by an Emmy Noether grant of the DFG) has been able to add more such bridges between computation and inference, across the domains of numerical computation, by showing that various basic numerical methods are MAP estimates under equally basic probabilistic priors. Hennig and Kiefel [ ] showed that quasi-Newton methods, such as the BFGS rule, arise as the mean of a Gaussian distribution over the elements of the inverse Hessian matrix of an optimization objective. Hennig extended this result to linear solvers [ ], in particular the method of conjugate gradients for linear problems (which amounts to Gaussian regression on the elements of the inverse of a symmetric matrix. And regarding ordinary differential equations, Schober, Duvenaud and Hennig [ ] showed that various Runge-Kutta methods can be interpreted as autoregressive filters, returning a Gaussian process posterior over the solution of a differential equation.

The picture emerging from these connections is a mathematically precise description of computation itself as the active collection of information. In this view, the analytic description of a numerical task provides a prior probability measure over possible solutions; which can be shrunk towards the limit of a point measure through conditioning on the result of tractable computations. Many concepts and philosophical problems from statistics carry over to computation quite naturally, with two notable differences: First, in numerical ``inference'' tasks, the validity of the prior can be analysed to a higher formal degree than in inference from physical data sources, because the task is specified in a formal (programming) language. Secondly, since numerical routines are the bottom, ``mechanistic'' layer of artificial intelligence, the ``inner loop'', they are subject to strict requirements on computational complexity. Internally, a numerical method can only use tractable floating-point operations. This translates into a constraint on acceptable probabilistic models -- most basic numerical methods make Gaussian assumptions.

In the context of machine learning, this description of numerical computation as the collection of information opens a number of exciting directions:

- Once it is clear that a numerical method uses an implicit prior, it is natural to adapt this prior to reflect available knowledge about the integrand. This design of "customized numerics" was used in a collaboration with colleagues at Oxford to build an efficient active integration method that outperforms Monte Carlo integration methods in wall-clock time on integration problems of moderate dimensionality [ ].
- Many numerical problems are defined relative to a setup that is itself uncertain to begin with. Once numerical methods are defined as probabilistic inference, such uncertainty can often be captured quite naturally. In a collaboration with colleagues in Copenhagen, Hauberg, Schober, Hennig and others [ ] showed how uncertainty arising from a medical imaging process can be propagated in an approximate inference fashion to more completely model uncertainty over neural pathways in the human brain.
- Explicit representations of uncertainty can also be used to increase robustness of a numerical computation itself. Addressing a pertinent issue in deep learning, Mahsereci and Hennig [ ] constructed a line search method---an elementary building block of nonlinear optimization methods---that is able to work with gradient evaluations corrupted by noise. This method drastically simplifies the problem of choosing a step size for the stochastic gradient descent methods used to train deep neural networks.
- More generally, it is possible to define
*probabilistic numerical methods*, which accept probability measures over a numerical problem as inputs, and return another probability measure over the solution of the problem, which reflects both the effect of the input uncertainty, and uncertainty arising from the finite precision of the internal computation itself. A position paper by Hennig, Osborne and Girolami [ ] motivates this class of methods and suggests their use for the control of computational effort across heterogeneous, composite chains of computations, such as those that make up intelligent machines. A DFG-funded collaboration between Mark Bangert's group at the German Cancer Research Centre and the Emmy Noether group develops an approximate inference framework to propagate physical uncertainties through the planning (optimization) pipeline for radiation treatment, to lower the risk of unnecessary complications for cancer patients [ ].

- More generally, it is possible to define

In a separate but related development, a community has also arisen around the formulation of *global* optimization as inference, and the formulation of sample-efficient optimization methods. These *Bayesian Optimization* methods can, for example, be used to structurally optimize, automate the design of machine learning models themselves. We contributed to this area with the development of the *Entropy Search* [ ] algorithm that automatically performs experiments expected to provide maximal information about the location of a function's extremum.

The Emmy Noether group on Probabilistic Numerics keeps close collaborative contact with the groups of Michael A Osborne in Oxford and Mark Girolami in Warwick. Together with the colleagues in Oxford, Philipp Hennig hosts the community portal probabilistic-numerics.org. Members of the group have co-organized a growing number of international meetings on probabilistic numerics, including, among others, several workshops at NIPS (including the very first workshop on Probabilistic Numerics), in Tübingen, at the DALI conference, and at the University of Warwick.

Probability Theory is an integral part of machine learning as a discipline, and continues to play a major role in the development of the ﬁeld. The modularity of probabilistic models has repeatedly allowed for them to be extended and developed into general frameworks

for large classes of problems, like regression and density estimation. In many areas of statistics and machine learning, probabilistic formulations were not the ﬁrst to the table, but still helped improve understanding of implicit assumptions of the established statistical algorithms, helped with theoretical analysis, and lit the way to algorithmic extensions. Over the years, members of the department of empirical inference have contributed substantially to the ﬁeld of probabilistic learning.

23 results
(BibTeX)

**Probabilistic Line Searches for Stochastic Optimization**
In *Advances in Neural Information Processing Systems 28*, pages: 181-189, (Editors: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama and R. Garnett), Curran Associates, Inc., 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015 (inproceedings)

**Probabilistic numerics and uncertainty in computations**
*Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences*, 471(2179), 2015 (article)

**A Random Riemannian Metric for Probabilistic Shortest-Path Tractography**
In *18th International Conference on Medical Image Computing and Computer Assisted Intervention*, 9349, pages: 597-604, Lecture Notes in Computer Science, MICCAI, 2015 (inproceedings)

**Probabilistic Interpretation of Linear Solvers**
*SIAM Journal on Optimization*, 25(1):234-260, 2015 (article)

**Probabilistic Progress Bars**
In *Conference on Pattern Recognition (GCPR)*, 8753, pages: 331-341, Lecture Notes in Computer Science, (Editors: Jiang, X., Hornegger, J., and Koch, R.), Springer, GCPR, September 2014 (inproceedings)

**Probabilistic Solutions to Differential Equations and their Application to Riemannian Statistics**
In *Proceedings of the 17th International Conference on Artificial Intelligence and Statistics*, 33, pages: 347-355, JMLR: Workshop and Conference Proceedings, (Editors: S Kaski and J Corander), Microtome Publishing, Brookline, MA, AISTATS, April 2014 (inproceedings)

**Probabilistic ODE Solvers with Runge-Kutta Means**
In *Advances in Neural Information Processing Systems 27*, pages: 739-747, (Editors: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger), Curran Associates, Inc., 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014 (inproceedings)

**Active Learning of Linear Embeddings for Gaussian Processes**
In *Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence*, pages: 230-239, (Editors: NL Zhang and J Tian), AUAI Press , Corvallis, Oregon, UAI2014, 2014, another link: http://arxiv.org/abs/1310.6740 (inproceedings)

**Probabilistic Shortest Path Tractography in DTI Using Gaussian Process ODE Solvers**
In *Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014, Lecture Notes in Computer Science Vol. 8675*, pages: 265-272, (Editors: P. Golland, N. Hata, C. Barillot, J. Hornegger and R. Howe), Springer, Heidelberg, MICCAI, 2014 (inproceedings)

**Incremental Local Gaussian Regression**
In *Advances in Neural Information Processing Systems 27*, pages: 972-980, (Editors: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger), 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014, clmc (inproceedings)

**Efficient Bayesian Local Model Learning for Control**
In *Proceedings of the IEEE International Conference on Intelligent Robots and Systems*, pages: 2244 - 2249, IROS, 2014, clmc (inproceedings)

**Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature**
In *Advances in Neural Information Processing Systems 27*, pages: 2789-2797, (Editors: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger), Curran Associates, Inc., 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014 (inproceedings)

**Quasi-Newton Methods: A New Direction**
*Journal of Machine Learning Research*, 14(1):843-865, March 2013 (article)

**Fast Probabilistic Optimization from Noisy Gradients**
In *Proceedings of The 30th International Conference on Machine Learning, JMLR W&CP 28(1)*, pages: 62–70, (Editors: S Dasgupta and D McAllester), ICML, 2013 (inproceedings)

**Nonparametric dynamics estimation for time periodic systems**
In *Proceedings of the 51st Annual Allerton Conference on Communication, Control, and Computing*, pages: 486-493 , 2013 (inproceedings)

**Analytical probabilistic modeling for radiation therapy treatment planning**
*Physics in Medicine and Biology*, 58(16):5401-5419, 2013 (article)

**Analytical probabilistic proton dose calculation and range uncertainties**
In *17th International Conference on the Use of Computers in Radiation Therapy*, pages: 6-11, (Editors: A. Haworth and T. Kron), ICCR, 2013 (inproceedings)

**Animating Samples from Gaussian Distributions**
(8), Max Planck Institute for Intelligent Systems, Tübingen, Germany, 2013 (techreport)

**Quasi-Newton Methods: A New Direction**
In *Proceedings of the 29th International Conference on Machine Learning*, pages: 25-32, ICML ’12, (Editors: John Langford and Joelle Pineau), Omnipress, New York, NY, USA, ICML, July 2012 (inproceedings)

**Kernel Topic Models**
In *Fifteenth International Conference on Artificial Intelligence and Statistics*, 22, pages: 511-519, JMLR Proceedings, (Editors: Lawrence, N. D. and Girolami, M.), JMLR.org, AISTATS , 2012 (inproceedings)