My scientific interests are in the field of machine learning and inference from empirical data. In particular, I study kernel methods for extracting regularities from possibly high-dimensional data. These regularities are usually statistical ones, however, in recent years I have also become interested in methods for finding causal structures that underly statistical dependences. I have worked on a number of different applications of machine learning - in our field, you get "to play in everyone's backyard." Most recently, I have been trying to play in the backyard of astronomers and photographers.
With the growing interest in (how to make money with) big data, machine learning has significantly gained in popularity. We have published an article in the German newspaper FAZ in January 2015, discussing some of the implications. Disclaimer: the newspaper added some text that appears above our names - this was not written or approved by us.
In March 2018, I published an article about the cybernetic revolution in the German newspaper SZ. It starts with the thesis that the current revolution is about processing (generating, converting, industrializing) information in much the same way the first two industrial revolutions dealt with processing (generating, converting, industrializing) energy. I have occasionally put forward this thesis (but I'm sure I am not the only one who thinks of it this way), for instance during a NYU symposium on the future of AI in January 2016 (here are some notes written by Max Tegmark). The article also provides recommendations on what Europe should do to keep up with the development.
My department and/or members of the department (incl. myself) receive funding from a number of sources including Max Planck, the DFG, the Alexander-von-Humboldt foundation, Amazon, Google, Bosch, Facebook, the BMBF (German Ministry of Science), the EU, the ETH Zürich, and the Stanford Center on Philanthropy and Civil Society.
M.Sc. in mathematics and Lionel Cooper Memorial Prize, University of London (1992)
Diplom in physics (Tübingen, 1994)
doctorate in computer science from the Technical University Berlin (1997); thesis on Support Vector Learning (main advisor: V. Vapnik, AT&T Bell Labs) won the annual dissertation prize of the German Association for Computer Science (GI)
If you'd like to contact me, please consider these two notes:
1. I recently became co-editor-in-chief of JMLR. I work for JMLR because I believe in its open access model, but it takes a lot of time. During my JMLR term, please don't convince me to do other journal or grant reviewing duties.
2. I am not very organized with my e-mail so if you want to apply for a position in my lab, please send your application only to Sekretariat-Schoelkopf@tuebingen.mpg.de. Note that we do not respond to non-personalized applications that look like they are being sent to a large number of places simultaneously.
We are always happy to receive outstanding applications for PhD positions and postdocs.
In Proceedings of the First IEEE International Conference Computational Photography (ICCP 2009), pages: 1-7, IEEE, Piscataway, NJ, USA, First IEEE International Conference on Computational Photography (ICCP), April 2009 (inproceedings)
Atmospheric turbulences blur astronomical images taken by earth-based telescopes. Taking many short-time exposures in such a situation provides noisy images of the same object, where each noisy image has a different blur. Commonly astronomers apply a technique called “Lucky Imaging” that selects a few of the recorded frames that fulfill certain criteria, such as reaching a certain peak intensity (“Strehl ratio”). The selected frames are then averaged to obtain a better image. In this paper we introduce and analyze a new method that exploits all the frames and generates an improved image in an online fashion. Our initial experiments with controlled artificial data and real-world astronomical datasets yields promising results.
Expert Systems with Applications, 36(2):3284-3292, March 2009 (article)
In bioinformatics, there exist multiple descriptions of graphs for the same set of genes or proteins. For instance, in yeast systems, graph edges can represent different relationships such as proteinprotein interactions, genetic interactions, or co-participation in a protein complex, etc. Relying on similarities between nodes, each graph can be used independently for prediction of protein function. However, since different graphs contain partly independent and partly complementary information about the problem at hand, one can enhance the total information extracted by combining all graphs. In this paper, we propose a method for integrating multiple graphs within a framework of semi-supervised learning. The method alternates between minimizing the objective function with respect to network output and with respect to combining weights. We apply the method to the task of protein functional class prediction in yeast. The proposed method performs significantly better than the same algorithm trained on any singl
European Journal of Nuclear Medicine and Molecular Imaging, 36(Supplement 1):93-104, March 2009 (article)
Introduction Positron emission tomography (PET) is a fully quantitative technology for imaging metabolic pathways and dynamic processes in vivo. Attenuation correction of raw PET data is a prerequisite for quantification and is typically based on separate transmission measurements. In PET/CT attenuation correction, however, is performed routinely based on the available CT transmission data.
Objective Recently, combined PET/magnetic resonance (MR) has been proposed as a viable alternative to PET/CT. Current concepts of PET/MRI do not include CT-like transmission sources and, therefore, alternative methods of PET attenuation correction must be found. This article reviews existing approaches to MR-based attenuation correction (MR-AC). Most groups have proposed MR-AC algorithms for brain PET studies and more recently also for torso PET/MR imaging. Most MR-AC strategies require the use of complementary MR and transmission images, or morphology templates generated from transmission images. We review and discuss these algorithms and point out challenges for using MR-AC in clinical routine.
Discussion MR-AC is work-in-progress with potentially promising results from a template-based approach applicable to both brain and torso imaging. While efforts are ongoing in making clinically viable MR-AC fully automatic, further studies are required to realize the potential benefits of MR-based motion compensation and partial volume correction of the PET data.
Neural Computation, 21(1):272-300, January 2009 (article)
We shed light on the discrimination between patterns belonging to two different classes by casting this decoding problem into a generalized prototype framework. The discrimination process is then separated into two stages: a projection stage that reduces the dimensionality of the data by projecting it on a line and a threshold stage where the distributions of the projected patterns of both classes are separated. For this, we extend the popular mean-of-class prototype classification using algorithms from machine learning that satisfy a set of invariance properties. We report a simple yet general approach to express different types of linear classification algorithms in an identical and easy-to-visualize formal framework using generalized prototypes where these prototypes are used to express the normal vector and offset of the hyperplane. We investigate nonmargin classifiers such as the classical prototype classifier, the Fisher classifier, and the relevance vector machine. We then study hard and soft margin cl
assifiers such as the support vector machine and a boosted version of the prototype classifier. Subsequently, we relate mean-of-class prototype classification to other classification algorithms by showing that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. While giving novel insights into classification per se by presenting a common and unified formalism, our generalized prototype framework also provides an efficient visualization and a principled comparison of machine learning classification.
In Dataset Shift in Machine Learning, pages: 131-160, (Editors: Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. and Lawrence, N. D.), MIT Press, Cambridge, MA, USA, 2009 (inbook)
Given sets of observations of training and test data, we consider the problem of re-weighting the training data such that its distribution more closely matches that of the test data. We achieve this goal by matching covariate distributions between training and test sets in a high dimensional feature space (specifically, a reproducing
kernel Hilbert space). This approach does not require distribution estimation. Instead, the sample weights are obtained by a simple quadratic programming procedure. We provide a uniform convergence bound on the distance between
the reweighted training feature mean and the test feature mean, a transductive bound on the expected loss of an algorithm trained on the reweighted data, and a connection to single class SVMs. While our method is designed to deal with the case of simple covariate shift (in the sense of Chapter ??), we have also found benefits for sample selection bias on the labels. Our correction procedure yields its greatest and most consistent advantages when the learning algorithm returns a classifier/regressor that is simpler" than the data might suggest.
In Advances in Neural Information Processing Systems 22, pages: 1750-1758, (Editors: Y Bengio and D Schuurmans and J Lafferty and C Williams and A Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)
Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced
in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of
the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated
on a homogeneity testing example.
Journal of Vision, 9(8):article 32, 2009 (article)
For simple visual patterns under the experimenter's control we impose which information, or features, an observer can use to solve a given perceptual task. For natural vision tasks, however, there are typically a multitude of potential features in a given visual scene which the visual system may be exploiting when analyzing it: edges, corners, contours, etc. Here we describe a novel non-linear system identification technique based on modern machine learning methods that allows the critical features an observer uses to be inferred directly from the observer's data. The method neither requires stimuli to be embedded in noise nor is it limited to linear perceptive fields (classification images). We demonstrate our technique by deriving the critical image features observers fixate in natural scenes (bottom-up visual saliency). Unlike previous studies where the relevant structure is determined manuallyâ€”e.g. by selecting Gabors as visual filtersâ€”we do not make any assumptions in this regard, but numerically infer number and properties them from the eye-movement data. We show that center-surround patterns emerge as the optimal solution for predicting saccade targets from local image structure. The resulting model, a one-layer feed-forward network with contrast gain-control, is surprisingly simple compared to previously suggested saliency models. Nevertheless, our model is equally predictive. Furthermore, our findings are consistent with neurophysiological hardware in the superior colliculus. Bottom-up visual saliency may thus not be computed cortically as has been thought previously.
In Kernel Methods for Remote Sensing Data Analysis, pages: 25-48, 2, (Editors: Gustavo Camps-Valls and Lorenzo Bruzzone), Wiley, New York, NY, USA, 2009 (inbook)
Kernel learning algorithms are currently becoming a standard tool in the area of machine learning and pattern recognition.
In this chapter we review the fundamental theory of kernel learning. As the basic building block we introduce the kernel function,
which provides an elegant and general way to compare possibly very complex objects. We then review the concept
of a reproducing kernel Hilbert space and state the representer theorem. Finally we give an overview of the most
prominent algorithms, which are support vector classification and regression, Gaussian Processes and kernel principal analysis.
With multiple kernel learning and structured output prediction we also introduce some more recent advancements in the field.
Pattern Recognition, 41(11):3271-3286, November 2008 (article)
Many common machine learning methods such as Support Vector Machines or Gaussian process
inference make use of positive definite kernels, reproducing kernel Hilbert spaces, Gaussian processes, and
regularization operators. In this work these objects are presented in a general, unifying framework, and
interrelations are highlighted.
With this in mind we then show how linear stochastic differential equation models can be incorporated
naturally into the kernel framework. And vice versa, many kernel machines can be interpreted in terms of
differential equations. We focus especially on ordinary differential equations, also known as dynamical
systems, and it is shown that standard kernel inference algorithms are equivalent to Kalman filter methods
based on such models.
In order not to cloud qualitative insights with heavy mathematical machinery, we restrict ourselves to finite
domains, implying that differential equations are treated via their corresponding finite difference equations.
(179), Max-Planck Institute for Biological Cybernetics, Tübingen, Germany, November 2008 (techreport)
We investigate an implicit method to compute a piecewise linear representation of a surface from a
set of sample points. As implicit surface functions we use the weighted sum of piecewise linear kernel functions.
For such a function we can partition Rd in such a way that these functions are linear on the subsets of the partition.
For each subset in the partition we can then compute the zero level set of the function exactly as the intersection of
a hyperplane with the subset.
In Computer Vision - ECCV 2008, Lecture Notes in Computer Science, Vol. 5304, pages: 126-139, (Editors: DA Forsyth and PHS Torr and A Zisserman), Springer, Berlin, Germany, 10th European Conference on Computer Vision, October 2008 (inproceedings)
We aim to color automatically greyscale images, without any manual intervention. The color proposition could then be interactively corrected by user-provided color landmarks if necessary. Automatic colorization is nontrivial since there is usually no one-to-one correspondence between color and local texture. The contribution of our framework is that we deal directly with multimodality and estimate, for each pixel of the image to be colored, the probability distribution of all possible colors,
instead of choosing the most probable color at the local level. We also predict the expected variation of color at each pixel, thus defining a nonuniform
spatial coherency criterion. We then use graph cuts to maximize the probability of the whole colored image at the global level. We work in the L-a-b color space in order to approximate the human perception of distances between colors, and we use machine learning tools to extract as much information as possible from a dataset of colored examples. The resulting algorithm is fast, designed to be more robust to texture noise, and is above all able to deal with ambiguity, in contrary to previous approaches.
Journal of Nuclear Medicine, 49(11):1875-1883, October 2008 (article)
For quantitative PET information, correction of tissue photon attenuation is mandatory. Generally in conventional PET, the attenuation map is obtained from a transmission scan, which uses a rotating radionuclide source, or from the CT scan in a combined PET/CT scanner. In the case of PET/MRI scanners currently under development, insufficient space for the rotating source exists; the attenuation map can be calculated from the MR image instead. This task is challenging because MR intensities correlate with proton densities and tissue-relaxation properties, rather than with attenuation-related mass density. METHODS: We used a combination of local pattern recognition and atlas registration, which captures global variation of anatomy, to predict pseudo-CT images from a given MR image. These pseudo-CT images were then used for attenuation correction, as the process would be performed in a PET/CT scanner. RESULTS: For human brain scans, we show on a database of 17 MR/CT image pairs that our method reliably enables e
stimation of a pseudo-CT image from the MR image alone. On additional datasets of MRI/PET/CT triplets of human brain scans, we compare MRI-based attenuation correction with CT-based correction. Our approach enables PET quantification with a mean error of 3.2% for predefined regions of interest, which we found to be clinically not significant. However, our method is not specific to brain imaging, and we show promising initial results on 1 whole-body animal dataset. CONCLUSION: This method allows reliable MRI-based attenuation correction for human brain scans. Further work is necessary to validate the method for whole-body imaging.
In FG 2008, pages: 1-8, IEEE Computer Society, Los Alamitos, CA, USA, 8th IEEE International Conference on Automatic Face and Gesture Recognition, September 2008 (inproceedings)
This paper presents a fully automated algorithm for reconstructing
a textured 3D model of a face from a single
photograph or a raw video stream. The algorithm is based
on a combination of Support Vector Machines (SVMs) and
a Morphable Model of 3D faces. After SVM face detection,
individual facial features are detected using a novel
regression- and classification-based approach, and probabilistically
plausible configurations of features are selected
to produce a list of candidates for several facial feature positions.
In the next step, the configurations of feature points
are evaluated using a novel criterion that is based on a
Morphable Model and a combination of linear projections.
To make the algorithm robust with respect to head orientation,
this process is iterated while the estimate of pose is
refined. Finally, the feature points initialize a model-fitting
procedure of the Morphable Model. The result is a highresolution
3D surface model.
In Advances in neural information processing systems 20, pages: 489-496, (Editors: JC Platt and D Koller and Y Singer and S Roweis), Curran, Red Hook, NY, USA, 21st Annual Conference on Neural Information Processing Systems (NIPS), September 2008 (inproceedings)
We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments.
In Advances in neural information processing systems 20, pages: 1369-1376, (Editors: JC Platt and D Koller and Y Singer and S Roweis), Curran, Red Hook, NY, USA, 21st Annual Conference on Neural Information Processing Systems (NIPS), September 2008 (inproceedings)
We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a
projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results.
Journal of Mathematical Psychology, 52(5):297-303, September 2008 (article)
Similarity is used as an explanatory construct throughout psychology and multidimensional scaling (MDS) is the most popular way to assess similarity. In MDS, similarity is intimately connected to the idea of a geometric representation of stimuli in a perceptual space. Whilst connecting similarity and closeness of stimuli in a geometric representation may be intuitively plausible, Tversky and Gati [Tversky, A., Gati, I. (1982). Similarity, separability, and the triangle inequality. Psychological Review, 89(2), 123154] have reported data which are inconsistent with the usual geometric representations that are based on segmental additivity. We show that similarity measures based on Shepards universal law of generalization [Shepard, R. N. (1987). Toward a universal law of generalization for psychologica science. Science, 237(4820), 13171323] lead to an inner product representation in a reproducing kernel Hilbert space. In such a space stimuli are represented by their similarity to all other stimuli. This representation, based on Shepards law, has a natural metric that does not have additive segments whilst still retaining the intuitive notion of connecting similarity and distance between stimuli. Furthermore, this representation has the psychologically appealing property that the distance between stimuli is bounded.
In Advances in neural information processing systems 20, pages: 585-592, (Editors: JC Platt and D Koller and Y Singer and S Roweis), Curran, Red Hook, NY, USA, 21st Annual Conference on Neural Information Processing Systems (NIPS), September 2008 (inproceedings)
Whereas kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m^2), where m is the sample size. We demonstrate that this test outperforms established contingency table-based tests. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
Hinterberger, T., Widmann, G., Lal, T., Hill, J., Tangermann, M., Rosenstiel, W., Schölkopf, B., Elger, C., Birbaumer, N.
Epilepsy and Behavior, 13(2):300-306, August 2008 (article)
Braincomputer interfaces (BCIs) can be used for communication in writing without muscular activity or for learning to control seizures by voluntary regulation of brain signals such as the electroencephalogram (EEG). Three of five patients with epilepsy were able to spell their names with electrocorticogram (ECoG) signals derived from motor-related areas within only one or two training sessions. Imagery of finger or tongue movements was classified with support-vector classification of autoregressive coefficients derived from the ECoG signals. After training of the classifier, binary classification responses were used to select letters from a computer-generated menu. Offline analysis showed increased theta activity in the unsuccessful patients, whereas the successful patients exhibited dominant sensorimotor rhythms that they could control. The high spatial resolution and increased signal-to-noise ratio in ECoG signals, combined with short training periods, may offer an alternative for communication in complete paralysis, locked-in syndrome, and motor restoration.
Rätsch, G., Clark, R., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G., Shinn, P., Warthman, N., Hu, T., Fu, G., Hinds, D., Cheng, H., Frazer, K., Huson, D., Schölkopf, B., Nordborg, M., Ecker, J., Weigel, D., Schneeberger, K., Bohlen, A.
16th Annual International Conference Intelligent Systems for Molecular Biology (ISMB), July 2008 (talk)
Laubinger, S., Zeller, G., Henz, S., Sachsenberg, T., Widmer, C., Naouar, N., Vuylsteke, M., Schölkopf, B., Rätsch, G., Weigel, D.
Genome Biology, 9(7: R112):1-16, July 2008 (article)
Gene expression maps for model organisms, including Arabidopsis thaliana, have typically been created using gene-centric expression arrays. Here, we describe a comprehensive expression atlas, Arabidopsis thaliana Tiling Array Express (At-TAX), which is based on whole-genome tiling arrays. We demonstrate that tiling arrays are accurate tools for gene expression analysis and identified more than 1,000 unannotated transcribed regions. Visualizations of gene expression estimates, transcribed regions, and tiling probe measurements are accessible online at the At-TAX homepage.
In Proceedings of the 25th International Conference onMachine Learning, pages: 992-999, (Editors: WW Cohen and A McCallum and S Roweis), ACM Press, New York, NY, USA, ICML, July 2008 (inproceedings)
Moment matching is a popular means of parametric
density estimation. We extend this technique
to nonparametric estimation of mixture
models. Our approach works by embedding
distributions into a reproducing kernel Hilbert
space, and performing moment matching in that
space. This allows us to tailor density estimators
to a function class of interest (i.e., for which
we would like to compute expectations). We
show our density estimation approach is useful
in applications such as message compression in
graphical models, and image classification and
In Proceedings of the 21st Annual Conference on Learning Theory, pages: 111-122, (Editors: RA Servedio and T Zhang), Omnipress, Madison, WI, USA, 21st Annual Conference on Learning Theory (COLT), July 2008 (inproceedings)
A Hilbert space embedding for probability measures
has recently been proposed, with applications
including dimensionality reduction, homogeneity
testing and independence testing. This embedding
represents any probability measure as a mean element
in a reproducing kernel Hilbert space (RKHS).
The embedding function has been proven to be injective
when the reproducing kernel is universal.
In this case, the embedding induces a metric on the
space of probability distributions defined on compact
In the present work, we consider more broadly the
problem of specifying characteristic kernels, defined
as kernels for which the RKHS embedding
of probability measures is injective. In particular,
characteristic kernels can include non-universal kernels.
We restrict ourselves to translation-invariant
kernels on Euclidean space, and define the associated
metric on probability measures in terms of
the Fourier spectrum of the kernel and characteristic
functions of these measures. The support of the
kernel spectrum is important in finding whether a
kernel is characteristic: in particular, the embedding
is injective if and only if the kernel spectrum
has the entire domain as its support. Characteristic
kernels may nonetheless have difficulty in distinguishing
certain distributions on the basis of finite
samples, again due to the interaction of the kernel
spectrum and the characteristic functions of the
In Proceedings of the 25th International Conference on Machine Learning, pages: 1112-1119, (Editors: WW Cohen and A McCallum and S Roweis), ACM Press, New York, NY, USA, ICML, July 2008 (inproceedings)
Most existing sparse Gaussian process (g.p.)
models seek computational advantages by
basing their computations on a set of m basis
functions that are the covariance function of
the g.p. with one of its two inputs fixed. We
generalise this for the case of Gaussian covariance
function, by basing our computations on
m Gaussian basis functions with arbitrary diagonal
covariance matrices (or length scales).
For a fixed number of basis functions and
any given criteria, this additional flexibility
permits approximations no worse and typically
better than was previously possible.
We perform gradient based optimisation of
the marginal likelihood, which costs O(m2n)
time where n is the number of data points,
and compare the method to various other
sparse g.p. methods. Although we focus on
g.p. regression, the central idea is applicable
to all kernel based algorithms, and we also
provide some results for the support vector
machine (s.v.m.) and kernel ridge regression
(k.r.r.). Our approach outperforms the other
methods, particularly for the case of very few
basis functions, i.e. a very high sparsity ratio.
Journal of Vision, 8(6):635, 8th Annual Meeting of the Vision Sciences Society (VSS), June 2008 (poster)
Humans perceives the world by directing the center of gaze from one location to another via rapid eye movements, called saccades. In the period between saccades the direction of gaze is held fixed for a few hundred milliseconds (fixations). It is primarily during fixations that information enters the visual system. Remarkably, however, after only a few fixations we perceive a coherent, high-resolution scene despite the visual acuity of the eye quickly decreasing away from the center of gaze: This suggests an effective strategy for selecting saccade targets.
Top-down effects, such as the observer's task, thoughts, or intentions have an effect on saccadic selection. Equally well known is that bottom-up effects-local image structure-influence saccade targeting regardless of top-down effects. However, the question of what the most salient visual features are is still under debate. Here we model the relationship between spatial intensity patterns in natural images and the response of the saccadic system using tools from machine learning. This allows us to identify the most salient image patterns that guide the bottom-up component of the saccadic selection system, which we refer to as perceptive fields. We show that center-surround patterns emerge as the optimal solution to the problem of predicting saccade targets. Using a novel nonlinear system identification technique we reduce our learned classifier to a one-layer feed-forward network which is surprisingly simple compared to previously suggested models assuming more complex computations such as multi-scale processing, oriented filters and lateral inhibition. Nevertheless, our model is equally predictive and generalizes better to novel image sets. Furthermore, our findings are consistent with neurophysiological hardware in the superior colliculus. Bottom-up visual saliency may thus not be computed cortically as has been thought previously.
Annals of Statistics, 36(3):1171-1220, June 2008 (article)
We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data.
(170), Max-Planck Institute for Biological Cybernetics, Tübingen, Germany, June 2008 (techreport)
This report summarizes the theory and some main applications of a new non-monotonic algorithm for
maximizing a Poisson Likelihood, which for Positron Emission Tomography (PET) is equivalent to minimizing
the associated Kullback-Leibler Divergence, and for Transmission Tomography is similar to maximizing the dual
of a maximum entropy problem. We call our method non-monotonic maximum likelihood (NMML) and show
its application to different problems such as tomography and image restoration. We discuss some theoretical
properties such as convergence for our algorithm. Our experimental results indicate that speedups obtained via our
non-monotonic methods are substantial.
Workshop on Geometry and Statistics of Shapes, June 2008 (talk)
With the help of differential geometry we describe a framework to define a thin-plate spline like energy for maps between arbitrary Riemannian manifolds. The so-called Eells energy only depends on the intrinsic geometry of the input and output manifold, but not on their respective representation. The energy can then be used for regression between manifolds, we present results for cases where the outputs are rotations, sets of angles, or points on 3D surfaces. In the future we plan to also target regression where the output is an element of "shape space", understood as a Riemannian manifold. One could also further explore the meaning of the Eells energy when applied to diffeomorphisms between shapes, especially with regard to its potential use as a distance measure between shapes that does not depend on the embedding or the parametrisation of the shapes.
Psychonomic Bulletin and Review, 15(2):256-271, April 2008 (article)
Exemplar theories of categorization depend on similarity for explaining subjects ability to
generalize to new stimuli. A major criticism of exemplar theories concerns their lack of abstraction
mechanisms and thus, seemingly, generalization ability. Here, we use insights from
machine learning to demonstrate that exemplar models can actually generalize very well. Kernel
methods in machine learning are akin to exemplar models and very successful in real-world
applications. Their generalization performance depends crucially on the chosen similaritymeasure.
While similarity plays an important role in describing generalization behavior it is not
the only factor that controls generalization performance. In machine learning, kernel methods
are often combined with regularization techniques to ensure good generalization. These same
techniques are easily incorporated in exemplar models. We show that the Generalized Context
Model (Nosofsky, 1986) and ALCOVE (Kruschke, 1992) are closely related to a statistical
model called kernel logistic regression. We argue that generalization is central to the enterprise
of understanding categorization behavior and suggest how insights from machine learning can
offer some guidance. Keywords: kernel, similarity, regularization, generalization, categorization.
(157), Max-Planck-Institute for Biological Cybernetics Tübingen, April 2008 (techreport)
We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time
approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg.~a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
Computer Graphics Forum, 27(2):437-448, April 2008 (article)
We present a generalization of thin-plate splines for interpolation and approximation of manifold-valued data, and
demonstrate its usefulness in computer graphics with several applications from different fields. The cornerstone
of our theoretical framework is an energy functional for mappings between two Riemannian manifolds which
is independent of parametrization and respects the geometry of both manifolds. If the manifolds are Euclidean,
the energy functional reduces to the classical thin-plate spline energy. We show how the resulting optimization
problems can be solved efficiently in many cases. Our example applications range from orientation interpolation
and motion planning in animation over geometric modelling tasks to color interpolation.
In Advances in Computational Intelligence and Learning: Proceedings of the European Symposium on Artificial Neural Networks, pages: 13-18, (Editors: M Verleysen), d-side, Evere, Belgium, 16th European Symposium on Artificial Neural Networks (ESANN), April 2008 (inproceedings)
While it is well-known that model can enhance the control
performance in terms of precision or energy efficiency, the practical application
has often been limited by the complexities of manually obtaining
sufficiently accurate models. In the past, learning has proven a viable alternative
to using a combination of rigid-body dynamics and handcrafted
approximations of nonlinearities. However, a major open question is what
nonparametric learning method is suited best for learning dynamics? Traditionally,
locally weighted projection regression (LWPR), has been the
standard method as it is capable of online, real-time learning for very complex
robots. However, while LWPR has had significant impact on learning
in robotics, alternative nonparametric regression methods such as support
vector regression (SVR) and Gaussian processes regression (GPR) offer interesting alternatives with fewer open parameters and potentially higher
accuracy. In this paper, we evaluate these three alternatives for model
learning. Our comparison consists out of the evaluation of learning quality
for each regression method using original data from SARCOS robot
arm, as well as the robot tracking performance employing learned models.
The results show that GPR and SVR achieve a superior learning precision
and can be applied for real-time control obtaining higher accuracy. However,
for the online learning LWPR presents the better method due to its
lower computational requirements.
Neurocomputing, 71(7-9):1248-1256, March 2008 (article)
We propose a method to quantify the complexity of conditional probability measures by a Hilbert space seminorm of the logarithm of its density. The concept of reproducing kernel Hilbert spaces (RKHSs) is a flexible tool to define such a seminorm by choosing an appropriate kernel. We present several examples with artificial data sets where our kernel-based complexity measure is consistent with our intuitive understanding of complexity of densities. The intention behind the complexity measure is to provide a new approach to inferring causal directions. The idea is that the
factorization of the joint probability measure P(effect, cause) into P(effect|cause)P(cause) leads typically to "simpler" and "smoother" terms than the factorization into P(cause|effect)P(effect). Since the conventional constraint-based approach of causal discovery is not able to determine the causal direction between only two variables, our inference principle can in particular be useful when combined with other existing methods. We provide several simple examples with real-world data where the true causal directions indeed lead to simpler (conditional) densities.
Macke, J., Maack, N., Gupta, R., Denk, W., Schölkopf, B., Borst, A.
Journal of Neuroscience Methods, 167(2):349-357, January 2008 (article)
A new technique, Serial Block Face Scanning Electron Microscopy (SBFSEM), allows for automatic
sectioning and imaging of biological tissue with a scanning electron microscope. Image
stacks generated with this technology have a resolution sufficient to distinguish different cellular
compartments, including synaptic structures, which should make it possible to obtain detailed
anatomical knowledge of complete neuronal circuits. Such an image stack contains several thousands
of images and is recorded with a minimal voxel size of 10-20nm in the x and y- and 30nm
in z-direction. Consequently, a tissue block of 1mm3 (the approximate volume of the Calliphora
vicina brain) will produce several hundred terabytes of data. Therefore, highly automated 3D
reconstruction algorithms are needed. As a first step in this direction we have developed semiautomated
segmentation algorithms for a precise contour tracing of cell membranes. These
algorithms were embedded into an easy-to-operate user interface, which allows direct 3D observation
of the extracted objects during the segmentation of image stacks. Compared to purely
manual tracing, processing time is greatly accelerated.
(167), Max Planck Institute for Biological Cybernetics, Tübingen, January 2008 (techreport)
This technical report is merely an extended version of the appendix of Steinke et.al. "Manifold-valued Thin-Plate Splines with Applications in
Computer Graphics" (2008) with complete proofs,
which had to be omitted due to space restrictions. This technical report requires a basic knowledge of differential
geometry. However, apart from that requirement the technical report is self-contained.
Journal of Mathematical Psychology, 51(6):343-358, December 2007 (article)
The abilities to learn and to categorize are fundamental for cognitive systems, be it animals or machines, and therefore have attracted attention from engineers and psychologists alike. Modern machine learning methods and psychological models of categorization are remarkably similar, partly because these two fields share a common history in artificial neural networks and reinforcement learning. However, machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis. Much of this research is potentially interesting for psychological theories of learning and categorization but also hardly accessible for psychologists. Here, we provide a tutorial introduction to a popular class of machine learning tools, called kernel methods. These methods are closely related to perceptrons, radial-basis-function neural networks and exemplar theories of catego
rization. Recent theoretical advances in machine learning are closely tied to the idea that the similarity of patterns can be encapsulated in a positive definite kernel. Such a positive definite kernel can define a reproducing kernel Hilbert space which allows one to use powerful tools from functional analysis for the analysis of learning algorithms. We give basic explanations of some key conceptsthe so-called kernel trick, the representer theorem and regularizationwhich may open up the possibility that insights from machine learning can feed back into psychology.
2007 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS-MIC 2007), 2007(M16-6):1-2, November 2007 (poster)
PET/MR combines the high soft tissue contrast of Magnetic Resonance Imaging (MRI) and the functional information of Positron Emission Tomography (PET). For quantitative PET information, correction of tissue photon attenuation is mandatory. Usually in conventional PET, the attenuation map is obtained from a transmission scan, which uses a rotating source, or from the CT scan in case of combined PET/CT. In the case of a PET/MR scanner, there is insufficient space for the rotating source and ideally one would want to calculate the attenuation map from the MR image instead. Since MR images provide information about proton density of the different tissue types, it is not trivial to use this data for PET attenuation correction. We present a method for predicting the PET attenuation map from a given the MR image, using a combination of atlas-registration and recognition of local patterns.
Using "leave one out cross validation" we show on a database of 16 MR-CT image pairs that our method reliably allows estimating the CT image from the MR image. Subsequently, as in PET/CT, the PET attenuation map can be predicted from the CT image. On an additional dataset of MR/CT/PET triplets we quantitatively validate that our approach allows PET quantification with an error that is smaller than what would be clinically significant.
We demonstrate our approach on T1-weighted human brain scans. However, the presented methods are more general and current research focuses on applying the established methods to human whole body PET/MRI applications.
Sonnenburg, S., Braun, M., Ong, C., Bengio, S., Bottou, L., Holmes, G., LeCun, Y., Müller, K., Pereira, F., Rasmussen, C., Rätsch, G., Schölkopf, B., Smola, A., Vincent, P., Weston, J., Williamson, R.
Journal of Machine Learning Research, 8, pages: 2443-2466, October 2007 (article)
Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not realized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.
Proceedings of the 10th International Conference on Discovery Science (DS 2007), 10, pages: 40-41, October 2007 (poster)
While kernel methods are the basis of many popular techniques in supervised learning, they are less commonly used in testing, estimation, and analysis of probability distributions, where information theoretic approaches rule the roost. However it becomes difficult to estimate mutual information or entropy if the data are high dimensional.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems