Header logo is ei


2006


no image
Uniform Convergence of Adaptive Graph-Based Regularization

Hein, M.

In COLT 2006, pages: 50-64, (Editors: Lugosi, G. , H.-U. Simon), Springer, Berlin, Germany, 19th Annual Conference on Learning Theory, September 2006 (inproceedings)

Abstract
The regularization functional induced by the graph Laplacian of a random neighborhood graph based on the data is adaptive in two ways. First it adapts to an underlying manifold structure and second to the density of the data-generating probability measure. We identify in this paper the limit of the regularizer and show uniform convergence over the space of Hoelder functions. As an intermediate step we derive upper bounds on the covering numbers of Hoelder functions on compact Riemannian manifolds, which are of independent interest for the theoretical analysis of manifold-based learning methods.

PDF Web DOI [BibTex]

2006

PDF Web DOI [BibTex]


no image
Efficient Large Scale Linear Programming Support Vector Machines

Sra, S.

In ECML 2006, pages: 767-774, (Editors: Fürnkranz, J. , T. Scheffer, M. Spiliopoulou), Springer, Berlin, Germany, 17th European Conference on Machine Learning, September 2006 (inproceedings)

Abstract
This paper presents a decomposition method for efficiently constructing ℓ1-norm Support Vector Machines (SVMs). The decomposition algorithm introduced in this paper possesses many desirable properties. For example, it is provably convergent, scales well to large datasets, is easy to implement, and can be extended to handle support vector regression and other SVM variants. We demonstrate the efficiency of our algorithm by training on (dense) synthetic datasets of sizes up to 20 million points (in ℝ32). The results show our algorithm to be several orders of magnitude faster than a previously published method for the same task. We also present experimental results on real data sets—our method is seen to be not only very fast, but also highly competitive against the leading SVM implementations.

Web DOI [BibTex]

Web DOI [BibTex]


no image
Regularised CSP for Sensor Selection in BCI

Farquhar, J., Hill, N., Lal, T., Schölkopf, B.

In Proceedings of the 3rd International Brain-Computer Interface Workshop and Training Course 2006, pages: 14-15, (Editors: GR Müller-Putz and C Brunner and R Leeb and R Scherer and A Schlögl and S Wriessnegger and G Pfurtscheller), Verlag der Technischen Universität Graz, Graz, Austria, 3rd International Brain-Computer Interface Workshop and Training Course, September 2006 (inproceedings)

Abstract
The Common Spatial Pattern (CSP) algorithm is a highly successful method for efficiently calculating spatial filters for brain signal classification. Spatial filtering can improve classification performance considerably, but demands that a large number of electrodes be mounted, which is inconvenient in day-to-day BCI usage. The CSP algorithm is also known for its tendency to overfit, i.e. to learn the noise in the training set rather than the signal. Both problems motivate an approach in which spatial filters are sparsified. We briefly sketch a reformulation of the problem which allows us to do this, using 1-norm regularisation. Focusing on the electrode selection issue, we present preliminary results on EEG data sets that suggest that effective spatial filters may be computed with as few as 10--20 electrodes, hence offering the potential to simplify the practical realisation of BCI systems significantly.

PDF PDF [BibTex]

PDF PDF [BibTex]


no image
Time-Dependent Demixing of Task-Relevant EEG Signals

Hill, N., Farquhar, J., Lal, T., Schölkopf, B.

In Proceedings of the 3rd International Brain-Computer Interface Workshop and Training Course 2006, pages: 20-21, (Editors: GR Müller-Putz and C Brunner and R Leeb and R Scherer and A Schlögl and S Wriessnegger and G Pfurtscheller), Verlag der Technischen Universität Graz, Graz, Austria, 3rd International Brain-Computer Interface Workshop and Training Course, September 2006 (inproceedings)

Abstract
Given a spatial filtering algorithm that has allowed us to identify task-relevant EEG sources, we present a simple approach for monitoring the activity of these sources while remaining relatively robust to changes in other (task-irrelevant) brain activity. The idea is to keep spatial *patterns* fixed rather than spatial filters, when transferring from training to test sessions or from one time window to another. We show that a fixed spatial pattern (FSP) approach, using a moving-window estimate of signal covariances, can be more robust to non-stationarity than a fixed spatial filter (FSF) approach.

PDF PDF [BibTex]

PDF PDF [BibTex]


no image
Inferential Structure Determination: Probabilistic determination and validation of NMR structures

Habeck, M.

Gordon Research Conference on Computational Aspects of Biomolecular NMR, September 2006 (talk)

Web [BibTex]

Web [BibTex]


no image
Transductive Gaussian Process Regression with Automatic Model Selection

Le, Q., Smola, A., Gärtner, T., Altun, Y.

In Machine Learning: ECML 2006, pages: 306-317, (Editors: Fürnkranz, J. , T. Scheffer, M. Spiliopoulou), Springer, Berlin, Germany, 17th European Conference on Machine Learning (ECML), September 2006 (inproceedings)

Abstract
n contrast to the standard inductive inference setting of predictive machine learning, in real world learning problems often the test instances are already available at training time. Transductive inference tries to improve the predictive accuracy of learning algorithms by making use of the information contained in these test instances. Although this description of transductive inference applies to predictive learning problems in general, most transductive approaches consider the case of classification only. In this paper we introduce a transductive variant of Gaussian process regression with automatic model selection, based on approximate moment matching between training and test data. Empirical results show the feasibility and competitiveness of this approach.

Web DOI [BibTex]

Web DOI [BibTex]


no image
A Sober Look at Clustering Stability

Ben-David, S., von Luxburg, U., Pal, D.

In COLT 2006, pages: 5-19, (Editors: Lugosi, G. , H.-U. Simon), Springer, Berlin, Germany, 19th Annual Conference on Learning Theory, September 2006 (inproceedings)

Abstract
Stability is a common tool to verify the validity of sample based algorithms. In clustering it is widely used to tune the parameters of the algorithm, such as the number k of clusters. In spite of the popularity of stability in practical applications, there has been very little theoretical analysis of this notion. In this paper we provide a formal definition of stability and analyze some of its basic properties. Quite surprisingly, the conclusion of our analysis is that for large sample size, stability is fully determined by the behavior of the objective function which the clustering algorithm is aiming to minimize. If the objective function has a unique global minimizer, the algorithm is stable, otherwise it is unstable. In particular we conclude that stability is not a well-suited tool to determine the number of clusters - it is determined by the symmetries of the data which may be unrelated to clustering parameters. We prove our results for center-based clusterings and for spectral clustering, and support our conclusions by many examples in which the behavior of stability is counter-intuitive.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Information Marginalization on Subgraphs

Huang, J., Zhu, T., Rereiner, R., Zhou, D., Schuurmans, D.

In ECML/PKDD 2006, pages: 199-210, (Editors: Fürnkranz, J. , T. Scheffer, M. Spiliopoulou), Springer, Berlin, Germany, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, September 2006 (inproceedings)

Abstract
Real-world data often involves objects that exhibit multiple relationships; for example, ‘papers’ and ‘authors’ exhibit both paper-author interactions and paper-paper citation relationships. A typical learning problem requires one to make inferences about a subclass of objects (e.g. ‘papers’), while using the remaining objects and relations to provide relevant information. We present a simple, unified mechanism for incorporating information from multiple object types and relations when learning on a targeted subset. In this scheme, all sources of relevant information are marginalized onto the target subclass via random walks. We show that marginalized random walks can be used as a general technique for combining multiple sources of information in relational data. With this approach, we formulate new algorithms for transduction and ranking in relational data, and quantify the performance of new schemes on real world data—achieving good results in many problems.

Web DOI [BibTex]

Web DOI [BibTex]


no image
Bayesian Active Learning for Sensitivity Analysis

Pfingsten, T.

In ECML 2006, pages: 353-364, (Editors: Fürnkranz, J. , T. Scheffer, M. Spiliopoulou), Springer, Berlin, Germany, 17th European Conference on Machine Learning, September 2006 (inproceedings)

Abstract
Designs of micro electro-mechanical devices need to be robust against fluctuations in mass production. Computer experiments with tens of parameters are used to explore the behavior of the system, and to compute sensitivity measures as expectations over the input distribution. Monte Carlo methods are a simple approach to estimate these integrals, but they are infeasible when the models are computationally expensive. Using a Gaussian processes prior, expensive simulation runs can be saved. This Bayesian quadrature allows for an active selection of inputs where the simulation promises to be most valuable, and the number of simulation runs can be reduced further. We present an active learning scheme for sensitivity analysis which is rigorously derived from the corresponding Bayesian expected loss. On three fully featured, high dimensional physical models of electro-mechanical sensors, we show that the learning rate in the active learning scheme is significantly better than for passive learning.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
An Online-Computation Approach to Optimal Finite-Horizon State-Feedback Control of Nonlinear Stochastic Systems

Deisenroth, MP.

Biologische Kybernetik, Universität Karlsruhe (TH), Karlsruhe, Germany, August 2006 (diplomathesis)

PDF [BibTex]

PDF [BibTex]


no image
From outliers to prototypes: Ordering data

Harmeling, S., Dornhege, G., Tax, D., Meinecke, F., Müller, K.

Neurocomputing, 69(13-15):1608-1618, August 2006 (article)

Abstract
We propose simple and fast methods based on nearest neighbors that order objects from high-dimensional data sets from typical points to untypical points. On the one hand, we show that these easy-to-compute orderings allow us to detect outliers (i.e. very untypical points) with a performance comparable to or better than other often much more sophisticated methods. On the other hand, we show how to use these orderings to detect prototypes (very typical points) which facilitate exploratory data analysis algorithms such as noisy nonlinear dimensionality reduction and clustering. Comprehensive experiments demonstrate the validity of our approach.

PDF PDF DOI [BibTex]

PDF PDF DOI [BibTex]


no image
An Online Support Vector Machine for Abnormal Events Detection

Davy, M., Desobry, F., Gretton, A., Doncarli, C.

Signal Processing, 86(8):2009-2025, August 2006 (article)

Abstract
The ability to detect online abnormal events in signals is essential in many real-world Signal Processing applications. Previous algorithms require an explicit signal statistical model, and interpret abnormal events as statistical model abrupt changes. Corresponding implementation relies on maximum likelihood or on Bayes estimation theory with generally excellent performance. However, there are numerous cases where a robust and tractable model cannot be obtained, and model-free approaches need to be considered. In this paper, we investigate a machine learning, descriptor-based approach that does not require an explicit descriptors statistical model, based on Support Vector novelty detection. A sequential optimization algorithm is introduced. Theoretical considerations as well as simulations on real signals demonstrate its practical efficiency.

PDF PostScript PDF DOI [BibTex]

PDF PostScript PDF DOI [BibTex]


no image
A tutorial on spectral clustering

von Luxburg, U.

(149), Max Planck Institute for Biological Cybernetics, Tübingen, August 2006 (techreport)

Abstract
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. Nevertheless, on the first glance spectral clustering looks a bit mysterious, and it is not obvious to see why it works at all and what it really does. This article is a tutorial introduction to spectral clustering. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.

PDF [BibTex]

PDF [BibTex]


no image
Machine Learning Algorithms for Polymorphism Detection

Schweikert, G., Zeller, G., Clark, R., Ossowski, S., Warthmann, N., Shinn, P., Frazer, K., Ecker, J., Huson, D., Weigel, D., Schölkopf, B., Rätsch, G.

2nd ISCB Student Council Symposium, August 2006 (talk)

Abstract
Analyzing resequencing array data using machine learning, we obtain a genome-wide inventory of polymorphisms in 20 wild strains of Arabidopsis thaliana, including 750,000 single nucleotide poly- morphisms (SNPs) and thousands of highly polymorphic regions and deletions. We thus provide an unprecedented resource for the study of natural variation in plants.

Web [BibTex]

Web [BibTex]


no image
Integrating Structured Biological data by Kernel Maximum Mean Discrepancy

Borgwardt, K., Gretton, A., Rasch, M., Kriegel, H., Schölkopf, B., Smola, A.

Bioinformatics, 22(4: ISMB 2006 Conference Proceedings):e49-e57, August 2006 (article)

Abstract
Motivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic. The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology. Results: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors. Conclusions: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments.

Web DOI [BibTex]

Web DOI [BibTex]


no image
Towards the Inference of Graphs on Ordered Vertexes

Zien, A., Raetsch, G., Ong, C.

(150), Max Planck Institute for Biological Cybernetics, Tübingen, August 2006 (techreport)

Abstract
We propose novel methods for machine learning of structured output spaces. Specifically, we consider outputs which are graphs with vertices that have a natural order. We consider the usual adjacency matrix representation of graphs, as well as two other representations for such a graph: (a) decomposing the graph into a set of paths, (b) converting the graph into a single sequence of nodes with labeled edges. For each of the three representations, we propose an encoding and decoding scheme. We also propose an evaluation measure for comparing two graphs.

PDF [BibTex]

PDF [BibTex]


no image
Semi-supervised Hyperspectral Image Classification with Graphs

Bandos, T., Zhou, D., Camps-Valls, G.

In IGARSS 2006, pages: 3883-3886, IEEE Computer Society, Los Alamitos, CA, USA, IEEE International Conference on Geoscience and Remote Sensing, August 2006 (inproceedings)

Abstract
This paper presents a semi-supervised graph-based method for the classification of hyperspectral images. The method is designed to exploit the spatial/contextual information in the images through composite kernels. The proposed method produces smoother classifications with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. Good accuracy in high dimensional spaces and low number of labeled samples (ill-posed situations) are produced as compared to standard inductive support vector machines.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Large Scale Transductive SVMs

Collobert, R., Sinz, F., Weston, J., Bottou, L.

Journal of Machine Learning Research, 7, pages: 1687-1712, August 2006 (article)

Abstract
We show how the Concave-Convex Procedure can be applied to the optimization of Transductive SVMs, which traditionally requires solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach.

PostScript PDF PDF [BibTex]

PostScript PDF PDF [BibTex]


no image
Supervised Probabilistic Principal Component Analysis

Yu, S., Yu, K., Tresp, V., Kriegel, H., Wu, M.

In KDD 2006, pages: 464-473, (Editors: Ungar, L. ), ACM Press, New York, NY, USA, 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006 (inproceedings)

Abstract
Principal component analysis (PCA) has been extensively applied in data mining, pattern recognition and information retrieval for unsupervised dimensionality reduction. When labels of data are available, e.g.,~in a classification or regression task, PCA is however not able to use this information. The problem is more interesting if only part of the input data are labeled, i.e.,~in a semi-supervised setting. In this paper we propose a supervised PCA model called SPPCA and a semi-supervised PCA model called S$^2$PPCA, both of which are extensions of a probabilistic PCA model. The proposed models are able to incorporate the label information into the projection phase, and can naturally handle multiple outputs (i.e.,~in multi-task learning problems). We derive an efficient EM learning algorithm for both models, and also provide theoretical justifications of the model behaviors. SPPCA and S$^2$PPCA are compared with other supervised projection methods on various learning tasks, and show not only promising performance but also good scalability.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Pattern detection methods and systems and face detection methods and systems

Blake, A., Romdhani, S., Schölkopf, B., Torr, P. H. S.

United States Patent, No 7099504, August 2006 (patent)

[BibTex]

[BibTex]


no image
Building Support Vector Machines with Reduced Classifier Complexity

Keerthi, S., Chapelle, O., DeCoste, D.

Journal of Machine Learning Research, 7, pages: 1493-1515, July 2006 (article)

Abstract
Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size ($dmax$) to approximate the SVM primal cost function well; (3) it is efficient and roughly scales as $O(ndmax^2)$ where $n$ is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.

PDF [BibTex]

PDF [BibTex]


no image
Inferential structure determination: Overview and new developments

Habeck, M.

Sixth CCPN Annual Conference: Efficient and Rapid Structure Determination by NMR, July 2006 (talk)

Web [BibTex]

Web [BibTex]


no image
Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger

Ciaramita, M., Altun, Y.

In pages: 594-602, (Editors: Jurafsky, D. , E. Gaussier), Association for Computational Linguistics, Stroudsburg, PA, USA, 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), July 2006 (inproceedings)

Abstract
In this paper we approach word sense disambiguation and information extraction as a unified tagging problem. The task consists of annotating text with the tagset defined by the 41 Wordnet supersense classes for nouns and verbs. Since the tagset is directly related to Wordnet synsets, the tagger returns partial word sense disambiguation. Furthermore, since the noun tags include the standard named entity detection classes – person, location, organization, time, etc. – the tagger, as a by-product, returns extended named entity information. We cast the problem of supersense tagging as a sequential labeling task and investigate it empirically with a discriminatively-trained Hidden Markov Model. Experimental evaluation on the main sense-annotated datasets available, i.e., Semcor and Senseval, shows considerable improvements over the best known “first-sense” baseline.

Web [BibTex]

Web [BibTex]


no image
ARTS: Accurate Recognition of Transcription Starts in Human

Sonnenburg, S., Zien, A., Rätsch, G.

Bioinformatics, 22(14):e472-e480, July 2006 (article)

Abstract
Motivation: One of the most important features of genomic DNA are the protein-coding genes. While it is of great value to identify those genes and the encoded proteins, it is also crucial to understand how their transcription is regulated. To this end one has to identify the corresponding promoters and the contained transcription factor binding sites. TSS finders can be used to locate potential promoters. They may also be used in combination with other signal and content detectors to resolve entire gene structures. Results: We have developed a novel kernel based method - called ARTS - that accurately recognizes transcription start sites in human. The application of otherwise too computationally expensive Support Vector Machines was made possible due to the use of efficient training and evaluation techniques using suffix tries. In a carefully designed experimental study, we compare our TSS finder to state-of-the-art methods from the literature: McPromoter, Eponine and FirstEF. For given false positive rates within a reasonable range, we consistently achieve considerably higher true positive rates. For instance, ARTS finds about 24% true positives at a false positive rate of 1/1000, where the other methods find less than half (10.5%). Availability: Datasets, model selection results, whole genome predictions, and additional experimental results are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/arts

Web DOI [BibTex]

Web DOI [BibTex]


no image
MR/PET Attenuation Correction

Hofmann, M., Schölkopf, B., Steinke, F., Pichler, B.

Max-Planck-Gesellschaft, Biologische Kybernetik, July 2006 (patent)

[BibTex]

[BibTex]


no image
Large Scale Multiple Kernel Learning

Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.

Journal of Machine Learning Research, 7, pages: 1531-1565, July 2006 (article)

Abstract
While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun.

PDF [BibTex]

PDF [BibTex]


no image
Factorial coding of natural images: how effective are linear models in removing higher-order dependencies?

Bethge, M.

Journal of the Optical Society of America A, 23(6):1253-1268, June 2006 (article)

Abstract
The performance of unsupervised learning models for natural images is evaluated quantitatively by means of information theory. We estimate the gain in statistical independence (the multi-information reduction) achieved with independent component analysis (ICA), principal component analysis (PCA), zero-phase whitening, and predictive coding. Predictive coding is translated into the transform coding framework, where it can be characterized by the constraint of a triangular filter matrix. A randomly sampled whitening basis and the Haar wavelet are included into the comparison as well. The comparison of all these methods is carried out for different patch sizes, ranging from 2x2 to 16x16 pixels. In spite of large differences in the shape of the basis functions, we find only small differences in the multi-information between all decorrelation transforms (5% or less) for all patch sizes. Among the second-order methods, PCA is optimal for small patch sizes and predictive coding performs best for large patch sizes. The extra gain achieved with ICA is always less than 2%. In conclusion, the `edge filters‘ found with ICA lead only to a surprisingly small improvement in terms of its actual objective.

PDF Web [BibTex]


no image
A Continuation Method for Semi-Supervised SVMs

Chapelle, O., Chi, M., Zien, A.

In ICML 2006, pages: 185-192, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
Semi-Supervised Support Vector Machines (S3VMs) are an appealing method for using unlabeled data in classification: their objective function favors decision boundaries which do not cut clusters. However their main problem is that the optimization problem is non-convex and has many local minima, which often results in suboptimal performances. In this paper we propose to use a global optimization technique known as continuation to alleviate this problem. Compared to other algorithms minimizing the same objective function, our continuation method often leads to lower test errors.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Classification of natural scenes: Critical features revisited

Drewes, J., Wichmann, F., Gegenfurtner, K.

Journal of Vision, 6(6):561, 6th Annual Meeting of the Vision Sciences Society (VSS), June 2006 (poster)

Abstract
Human observers are capable of detecting animals within novel natural scenes with remarkable speed and accuracy. Despite the seeming complexity of such decisions it has been hypothesized that a simple global image feature, the relative abundance of high spatial frequencies at certain orientations, could underly such fast image classification (A. Torralba & A. Oliva, Network: Comput. Neural Syst., 2003). We successfully used linear discriminant analysis to classify a set of 11.000 images into “animal” and “non-animal” images based on their individual amplitude spectra only (Drewes, Wichmann, Gegenfurtner VSS 2005). We proceeded to sort the images based on the performance of our classifier, retaining only the best and worst classified 400 images (“best animals”, “best distractors” and “worst animals”, “worst distractors”). We used a Go/No-go paradigm to evaluate human performance on this subset of our images. Both reaction time and proportion of correctly classified images showed a significant effect of classification difficulty. Images more easily classified by our algorithm were also classified faster and better by humans, as predicted by the Torralba & Oliva hypothesis. We then equated the amplitude spectra of the 400 images, which, by design, reduced algorithmic performance to chance whereas human performance was only slightly reduced (cf. Wichmann, Rosas, Gegenfurtner, VSS 2005). Most importantly, the same images as before were still classified better and faster, suggesting that even in the original condition features other than specifics of the amplitude spectrum made particular images easy to classify, clearly at odds with the Torralba & Oliva hypothesis.

Web DOI [BibTex]

Web DOI [BibTex]


no image
MCMC inference in (Conditionally) Conjugate Dirichlet Process Gaussian Mixture Models

Rasmussen, C., Görür, D.

ICML Workshop on Learning with Nonparametric Bayesian Methods, June 2006 (talk)

Abstract
We compare the predictive accuracy of the Dirichlet Process Gaussian mixture models using conjugate and conditionally conjugate priors and show that better density models result from using the wider class of priors. We explore several MCMC schemes exploiting conditional conjugacy and show their computational merits on several multidimensional density estimation problems.

Web [BibTex]

Web [BibTex]


no image
Unifying Divergence Minimization and Statistical Inference Via Convex Duality

Altun, Y., Smola, A.

In Learning Theory, pages: 139-153, (Editors: Lugosi, G. , H.-U. Simon), Springer, Berlin, Germany, 19th Annual Conference on Learning Theory (COLT), June 2006 (inproceedings)

Abstract
In this paper we unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation as a special case. Moreover, our treatment leads to stability and convergence bounds for many statistical learning problems. Finally, we show how an algorithm by Zhang can be used to solve this class of optimization problems efficiently.

Web DOI [BibTex]

Web DOI [BibTex]


no image
Trading Convexity for Scalability

Collobert, R., Sinz, F., Weston, J., Bottou, L.

In ICML 2006, pages: 201-208, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
Convex learning algorithms, such as Support Vector Machines (SVMs), are often seen as highly desirable because they offer strong practical properties and are amenable to theoretical analysis. However, in this work we show how non-convexity can provide scalability advantages over convexity. We show how concave-convex programming can be applied to produce (i) faster SVMs where training errors are no longer support vectors, and (ii) much faster Transductive SVMs.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Personalized handwriting recognition via biased regularization

Kienzle, W., Chellapilla, K.

In ICML 2006, pages: 457-464, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
We present a new approach to personalized handwriting recognition. The problem, also known as writer adaptation, consists of converting a generic (user-independent) recognizer into a personalized (user-dependent) one, which has an improved recognition rate for a particular user. The adaptation step usually involves user-specific samples, which leads to the fundamental question of how to fuse this new information with that captured by the generic recognizer. We propose adapting the recognizer by minimizing a regularized risk functional (a modified SVM) where the prior knowledge from the generic recognizer enters through a modified regularization term. The result is a simple personalization framework with very good practical properties. Experiments on a 100 class real-world data set show that the number of errors can be reduced by over 40% with as few as five user samples per character.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Sampling for non-conjugate infinite latent feature models

Görür, D., Rasmussen, C.

(Editors: Bernardo, J. M.), 8th Valencia International Meeting on Bayesian Statistics (ISBA), June 2006 (talk)

Abstract
Latent variable models are powerful tools to model the underlying structure in data. Infinite latent variable models can be defined using Bayesian nonparametrics. Dirichlet process (DP) models constitute an example of infinite latent class models in which each object is assumed to belong to one of the, mutually exclusive, infinitely many classes. Recently, the Indian buffet process (IBP) has been defined as an extension of the DP. IBP is a distribution over sparse binary matrices with infinitely many columns which can be used as a distribution for non-exclusive features. Inference using Markov chain Monte Carlo (MCMC) in conjugate IBP models has been previously described, however requiring conjugacy restricts the use of IBP. We describe an MCMC algorithm for non-conjugate IBP models. Modelling the choice behaviour is an important topic in psychology, economics and related fields. Elimination by Aspects (EBA) is a choice model that assumes each alternative has latent features with associated weights that lead to the observed choice outcomes. We formulate a non-parametric version of EBA by using IBP as the prior over the latent binary features. We infer the features of objects that lead to the choice data by using our sampling scheme for inference.

PDF [BibTex]

PDF [BibTex]


no image
Deterministic annealing for semi-supervised kernel machines

Sindhwani, V., Keerthi, S., Chapelle, O.

In ICML 2006, pages: 841-848, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
An intuitive approach to utilizing unlabeled data in kernel-based classification algorithms is to simply treat the unknown labels as additional optimization variables. For margin-based loss functions, one can view this approach as attempting to learn low-density separators. However, this is a hard optimization problem to solve in typical semi-supervised settings where unlabeled data is abundant. The popular Transductive SVM algorithm is a label-switching-retraining procedure that is known to be susceptible to local minima. In this paper, we present a global optimization framework for semi-supervised Kernel machines where an easier problem is parametrically deformed to the original hard problem and minimizers are smoothly tracked. Our approach is motivated from deterministic annealing techniques and involves a sequence of convex optimization problems that are exactly and efficiently solved. We present empirical results on several synthetic and real world datasets that demonstrate the effectiveness of our approach.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Clustering Graphs by Weighted Substructure Mining

Tsuda, K., Kudo, T.

In ICML 2006, pages: 953-960, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
Graph data is getting increasingly popular in, e.g., bioinformatics and text processing. A main difficulty of graph data processing lies in the intrinsic high dimensionality of graphs, namely, when a graph is represented as a binary feature vector of indicators of all possible subgraphs, the dimensionality gets too large for usual statistical methods. We propose an efficient method for learning a binomial mixture model in this feature space. Combining the $ell_1$ regularizer and the data structure called DFS code tree, the MAP estimate of non-zero parameters are computed efficiently by means of the EM algorithm. Our method is applied to the clustering of RNA graphs, and is compared favorably with graph kernels and the spectral graph distance.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
A Choice Model with Infinitely Many Latent Features

Görür, D., Jäkel, F., Rasmussen, C.

In ICML 2006, pages: 361-368, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
Elimination by aspects (EBA) is a probabilistic choice model describing how humans decide between several options. The options from which the choice is made are characterized by binary features and associated weights. For instance, when choosing which mobile phone to buy the features to consider may be: long lasting battery, color screen, etc. Existing methods for inferring the parameters of the model assume pre-specified features. However, the features that lead to the observed choices are not always known. Here, we present a non-parametric Bayesian model to infer the features of the options and the corresponding weights from choice data. We use the Indian buffet process (IBP) as a prior over the features. Inference using Markov chain Monte Carlo (MCMC) in conjugate IBP models has been previously described. The main contribution of this paper is an MCMC algorithm for the EBA model that can also be used in inference for other non-conjugate IBP models---this may broaden the use of IBP priors considerably.

PostScript PDF Web DOI [BibTex]

PostScript PDF Web DOI [BibTex]


no image
Learning High-Order MRF Priors of Color Images

McAuley, J., Caetano, T., Smola, A., Franz, MO.

In ICML 2006, pages: 617-624, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
In this paper, we use large neighborhood Markov random fields to learn rich prior models of color images. Our approach extends the monochromatic Fields of Experts model (Roth and Blackwell, 2005) to color images. In the Fields of Experts model, the curse of dimensionality due to very large clique sizes is circumvented by parameterizing the potential functions according to a product of experts. We introduce several simplifications of the original approach by Roth and Black which allow us to cope with the increased clique size (typically 3x3x3 or 5x5x3 pixels) of color images. Experimental results are presented for image denoising which evidence improvements over state-of-the-art monochromatic image priors.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Inference with the Universum

Weston, J., Collobert, R., Sinz, F., Bottou, L., Vapnik, V.

In ICML 2006, pages: 1009-1016, (Editors: Cohen, W. W., A. Moore), ACM Press, New York, NY, USA, 23rd International Conference on Machine Learning, June 2006 (inproceedings)

Abstract
WIn this paper we study a new framework introduced by Vapnik (1998) and Vapnik (2006) that is an alternative capacity concept to the large margin approach. In the particular case of binary classification, we are given a set of labeled examples, and a collection of "non-examples" that do not belong to either class of interest. This collection, called the Universum, allows one to encode prior knowledge by representing meaningful concepts in the same domain as the problem at hand. We describe an algorithm to leverage the Universum by maximizing the number of observed contradictions, and show experimentally that this approach delivers accuracy improvements over using labeled data alone.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
The pedestal effect is caused by off-frequency looking, not nonlinear transduction or contrast gain-control

Wichmann, F., Henning, B.

Journal of Vision, 6(6):194, 6th Annual Meeting of the Vision Sciences Society (VSS), June 2006 (poster)

Abstract
The pedestal or dipper effect is the large improvement in the detectabilty of a sinusoidal grating observed when the signal is added to a pedestal or masking grating having the signal‘s spatial frequency, orientation, and phase. The effect is largest with pedestal contrasts just above the ‘threshold‘ in the absence of a pedestal. We measured the pedestal effect in both broadband and notched masking noise---noise from which a 1.5- octave band centered on the signal and pedestal frequency had been removed. The pedestal effect persists in broadband noise, but almost disappears with notched noise. The spatial-frequency components of the notched noise that lie above and below the spatial frequency of the signal and pedestal prevent the use of information about changes in contrast carried in channels tuned to spatial frequencies that are very much different from that of the signal and pedestal. We conclude that the pedestal effect in the absence of notched noise results principally from the use of information derived from channels with peak sensitivities at spatial frequencies that are different from that of the signal and pedestal. Thus the pedestal or dipper effect is not a characteristic of individual spatial-frequency tuned channels.

Web DOI [BibTex]

Web DOI [BibTex]


no image
Classifying EEG and ECoG Signals without Subject Training for Fast BCI Implementation: Comparison of Non-Paralysed and Completely Paralysed Subjects

Hill, N., Lal, T., Schröder, M., Hinterberger, T., Wilhelm, B., Nijboer, F., Mochty, U., Widman, G., Elger, C., Schölkopf, B., Kübler, A., Birbaumer, N.

IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2):183-186, June 2006 (article)

Abstract
We summarize results from a series of related studies that aim to develop a motor-imagery-based brain-computer interface using a single recording session of EEG or ECoG signals for each subject. We apply the same experimental and analytical methods to 11 non-paralysed subjects (8 EEG, 3 ECoG), and to 5 paralysed subjects (4 EEG, 1 ECoG) who had been unable to communicate for some time. While it was relatively easy to obtain classifiable signals quickly from most of the non-paralysed subjects, it proved impossible to classify the signals obtained from the paralysed patients by the same methods. This highlights the fact that though certain BCI paradigms may work well with healthy subjects, this does not necessarily indicate success with the target user group. We outline possible reasons for this failure to transfer.

PDF PDF DOI [BibTex]

PDF PDF DOI [BibTex]


no image
Object Classification using Local Image Features

Nowozin, S.

Biologische Kybernetik, Technical University of Berlin, Berlin, Germany, May 2006 (diplomathesis)

Abstract
Object classification in digital images remains one of the most challenging tasks in computer vision. Advances in the last decade have produced methods to repeatably extract and describe characteristic local features in natural images. In order to apply machine learning techniques in computer vision systems, a representation based on these features is needed. A set of local features is the most popular representation and often used in conjunction with Support Vector Machines for classification problems. In this work, we examine current approaches based on set representations and identify their shortcomings. To overcome these shortcomings, we argue for extending the set representation into a graph representation, encoding more relevant information. Attributes associated with the edges of the graph encode the geometric relationships between individual features by making use of the meta data of each feature, such as the position, scale, orientation and shape of the feature region. At the same time all invariances provided by the original feature extraction method are retained. To validate the novel approach, we use a standard subset of the ETH-80 classification benchmark.

PDF [BibTex]

PDF [BibTex]


no image
SCARNA: Fast and Accurate Structural Alignment of RNA Sequences by Matching Fixed-Length Stem Fragments

Tabei, Y., Tsuda, K., Kin, T., Asai, K.

Bioinformatics, 22(14):1723-1729, May 2006 (article)

Abstract
The functions of non-coding RNAs are strongly related to their secondary structures, but it is known that a secondary structure prediction of a single sequence is not reliable. Therefore, we have to collect similar RNA sequences with a common secondary structure for the analyses of a new non-coding RNA without knowing the exact secondary structure itself. Therefore, the sequence comparison in searching similar RNAs should consider not only their sequence similarities but their potential secondary structures. Sankoff‘s algorithm predicts the common secondary structures of the sequences, but it is computationally too expensive to apply to large-scale analyses. Because we often want to compare a large number of cDNA sequences or to search similar RNAs in the whole genome sequences, much faster algorithms are required. We propose a new method of comparing RNA sequences based on the structural alignments of the fixed-length fragments of the stem candidates. The implemented software, SCARNA (Stem Candidate Aligner for RNAs), is fast enough to apply to the long sequences in the large-scale analyses. The accuracy of the alignments is better or comparable to the much slower existing algorithms.

PDF Web DOI [BibTex]


no image
Statistical Convergence of Kernel CCA

Fukumizu, K., Bach, F., Gretton, A.

In Advances in neural information processing systems 18, pages: 387-394, (Editors: Weiss, Y. , B. Schölkopf, J. Platt), MIT Press, Cambridge, MA, USA, Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), May 2006 (inproceedings)

Abstract
While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence.

PDF Web [BibTex]

PDF Web [BibTex]


no image
Response Modeling with Support Vector Machines

Shin, H., Cho, S.

Expert Systems with Applications, 30(4):746-760, May 2006 (article)

Abstract
Support Vector Machine (SVM) employs Structural Risk minimization (SRM) principle to generalize better than conventional machine learning methods employing the traditional Empirical Risk Minimization (ERM) principle. When applying SVM to response modeling in direct marketing,h owever,one has to deal with the practical difficulties: large training data,class imbalance and binary SVM output. This paper proposes ways to alleviate or solve the addressed difficulties through informative sampling,u se of different costs for different classes, and use of distance to decision boundary. This paper also provides various evaluation measures for response models in terms of accuracies,lift chart analysis and computational efficiency.

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Maximum Margin Semi-Supervised Learning for Structured Variables

Altun, Y., McAllester, D., Belkin, M.

In Advances in neural information processing systems 18, pages: 33-40, (Editors: Weiss, Y. , B. Schölkopf, J. Platt), MIT Press, Cambridge, MA, USA, Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), May 2006 (inproceedings)

Abstract
Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points.

PDF Web [BibTex]

PDF Web [BibTex]


no image
Generalized Nonnegative Matrix Approximations with Bregman Divergences

Dhillon, I., Sra, S.

In Advances in neural information processing systems 18, pages: 283-290, (Editors: Weiss, Y. , B. Schölkopf, J. Platt), MIT Press, Cambridge, MA, USA, Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), May 2006 (inproceedings)

Abstract
Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively efficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of "link" functions for modeling nonlinear relationships are also discussed.

PDF Web [BibTex]

PDF Web [BibTex]


no image
Fast Gaussian Process Regression using KD-Trees

Shen, Y., Ng, A., Seeger, M.

In Advances in neural information processing systems 18, pages: 1225-1232, (Editors: Weiss, Y. , B. Schölkopf, J. Platt), MIT Press, Cambridge, MA, USA, Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), May 2006 (inproceedings)

Abstract
The computation required for Gaussian process regression with n training examples is about O(n3) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.

PDF Web [BibTex]

PDF Web [BibTex]


no image
Products of "Edge-perts"

Gehler, PV., Welling, M.

In Advances in neural information processing systems 18, pages: 419-426, (Editors: Weiss, Y. , B. Schölkopf, J. Platt), MIT Press, Cambridge, MA, USA, Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), May 2006 (inproceedings)

Abstract
Images represent an important and abundant source of data. Understanding their statistical structure has important applications such as image compression and restoration. In this paper we propose a particular kind of probabilistic model, dubbed the “products of edge-perts model” to describe the structure of wavelet transformed images. We develop a practical denoising algorithm based on a single edge-pert and show state-ofthe-art denoising performance on benchmark images.

PDF Web [BibTex]

PDF Web [BibTex]