In Pattern Recognition: 39th German Conference, GCPR 2017, Basel, Switzerland, September 12–15, 2017, Proceedings, pages: 52-62, Springer International Publishing, Cham, 2017 (inbook)
Most object detection systems consist of three stages. First, a set of individual hypotheses for object locations is generated using a proposal generating algorithm. Second, a classifier scores every generated hypothesis independently to obtain a multi-class prediction. Finally, all scored hypotheses are filtered via a non-differentiable and decoupled non-maximum suppression (NMS) post-processing step. In this paper, we propose a filtering network (FNet), a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image. This formulation enables end-to-end training of the full object detection pipeline. First, we demonstrate that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS. We further analyze NMS failures and propose a loss formulation that is better aligned with the mean average precision (mAP) evaluation metric. We evaluate FNet on several standard detection datasets. Results surpass standard NMS on highly occluded settings of a synthetic overlapping MNIST dataset and show competitive behavior on PascalVOC2007 and KITTI detection benchmarks.
In Special Issue on Generative Models in Computer Vision and Medical Imaging, 136, pages: 32-44, Elsevier, July 2015 (inproceedings)
Computer vision is hard because of a large variability in lighting, shape, and texture; in addition the image signal is non-additive due to occlusion. Generative models promised to account for this variability by accurately modelling the image formation process as a function of latent variables with prior beliefs. Bayesian posterior inference could then, in principle, explain the observation. While intuitively appealing, generative models for computer vision have largely failed to deliver on that promise due to the difficulty of posterior inference. As a result the community has favored efficient discriminative approaches. We still believe in the usefulness of generative models in computer vision, but argue that we need to leverage existing discriminative or even heuristic computer vision methods.
We implement this idea in a principled way in our informed sampler and in careful experiments demonstrate it on challenging models which contain renderer programs as their components. The informed sampler, using simple discriminative proposals based on existing computer vision technology achieves dramatic improvements in inference. Our approach enables a new richness in generative models that was out of reach with existing inference technology.
Advanced Structured Prediction, pages: 432, Neural Information Processing Series, MIT Press, November 2014 (book)
The goal of structured prediction is to build machine learning models that predict relational information that itself has structure, such as being composed of multiple interrelated parts. These models, which reflect prior knowledge, task-specific relations, and constraints, are used in fields including computer vision, speech recognition, natural language processing, and computational biology. They can carry out such tasks as predicting a natural language sentence, or segmenting an image into meaningful components.
These models are expressive and powerful, but exact computation is often intractable. A broad research effort in recent years has aimed at designing structured prediction models and approximate inference and learning procedures that are computationally efficient. This volume offers an overview of this recent research in order to make the work accessible to a broader research community. The chapters, by leading researchers in the field, cover a range of topics, including research trends, the linear programming relaxation approach, innovations in probabilistic modeling, recent theoretical progress, and resource-aware learning.
In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 1314-1321, IEEE, IEEE International Conference on Computer Vision and Pattern Recognition, June 2014 (inproceedings)
Dynamic Bayesian networks such as Hidden Markov
Models (HMMs) are successfully used as probabilistic models
for human motion. The use of hidden variables makes
them expressive models, but inference is only approximate
and requires procedures such as particle filters or Markov
chain Monte Carlo methods. In this work we propose to instead
use simple Markov models that only model observed
quantities. We retain a highly expressive dynamic model by
using interactions that are nonlinear and non-parametric.
A presentation of our approach in terms of latent variables
shows logarithmic growth for the computation of exact loglikelihoods
in the number of latent states. We validate
our model on human motion capture data and demonstrate
state-of-the-art performance on action recognition and motion
In Proceedings IEEE Conf. on Computer Vision (ICCV), pages: 1281-1288, IEEE International Conference on Computer Vision, December 2013 (inproceedings)
Having a sensible prior of human pose is a vital ingredient for many computer vision applications, including tracking and pose estimation. While the application of global non-parametric approaches and parametric models has led to some success, finding the right balance in terms of flexibility and tractability, as well as estimating model parameters from data has turned out to be challenging. In this work, we introduce a sparse Bayesian network model of human pose that is non-parametric with respect to the estimation of both its graph structure and its local distributions. We describe an efficient sampling scheme for our model and show its tractability for the computation of exact log-likelihoods. We empirically validate our approach on the Human 3.6M dataset and demonstrate superior performance to global models and parametric networks. We further illustrate our model's ability to represent and compose poses not present in the training set (compositionality) and describe a speed-accuracy trade-off that allows realtime scoring of poses.
In Proceedings of 34th DAGM Symposium, pages: 397-407, Lecture Notes in Computer Science, (Editors: Pinz, Axel and Pock, Thomas and Bischof, Horst and Leberl, Franz), Springer, August 2012 (inproceedings)
pages: 494, Neural information processing series, MIT Press, Cambridge, MA, USA, December 2011 (book)
The interplay between optimization and machine learning is one of the most important developments in modern computational science. Optimization formulations and methods are proving to be vital in designing algorithms to extract essential knowledge from huge volumes of data. Machine learning, however, is not simply a consumer of optimization technology but a rapidly evolving field that is itself generating new optimization ideas. This book captures the state of the art of the interaction between optimization and machine learning in a way that is accessible to researchers in both fields.
Optimization approaches have enjoyed prominence in machine learning because of their wide applicability and attractive theoretical properties. The increasing complexity, size, and variety of today's machine learning models call for the reassessment of existing assumptions. This book starts the process of reassessment. It describes the resurgence in novel contexts of established frameworks such as first-order methods, stochastic approximations, convex relaxations, interior-point methods, and proximal methods. It also devotes attention to newer themes such as regularized optimization, robust optimization, gradient and subgradient methods, splitting techniques, and second-order methods. Many of these techniques draw inspiration from other fields, including operations research, theoretical computer science, and subfields of optimization. The book will enrich the ongoing cross-fertilization between the machine learning community and these other fields, and within the broader optimization community.
In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages: 2836-2843, IEEE Service Center, Piscataway, NJ, USA, CVPR, June 2009 (inproceedings)
Most modern computer vision systems for high-level tasks, such as image classification, object recognition and segmentation, are based on learning algorithms that are able to separate discriminative information from noise. In practice, however, the typical system consists of a long pipeline of pre-processing steps, such as extraction of different kinds of features, various kinds of normalizations, feature selection, and quantization into aggregated representations such as histograms. Along this pipeline, there are many parameters to set and choices to make, and their effect on the overall system performance is a-priori unclear. In this work, we shorten the pipeline in a principled way. We move pre-processing steps into the learning system by means of kernel parameters, letting the learning algorithm decide upon suitable parameter values. Learning to optimize the pre-processing choices becomes learning the kernel parameters. We realize this paradigm by extending the recent Multiple Kernel Learning formulation from the finite case of having a fixed number of kernels which can be combined to the general infinite case where each possible parameter setting induces an associated kernel. We evaluate the new paradigm extensively on image classification and object classification tasks. We show that it is possible to learn optimal discriminative codebooks and optimal spatial pyramid schemes, consistently outperforming all previous state-of-the-art approaches.
In CVPR 2009, pages: 818-825, IEEE Service Center, Piscataway, NJ, USA, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009 (inproceedings)
Markov random field (MRF, CRF) models are popular in
computer vision. However, in order to be computationally
tractable they are limited to incorporate only local interactions
and cannot model global properties, such as connectedness,
which is a potentially useful high-level prior
for object segmentation. In this work, we overcome this
limitation by deriving a potential function that enforces the
output labeling to be connected and that can naturally be
used in the framework of recent MAP-MRF LP relaxations.
Using techniques from polyhedral combinatorics, we show
that a provably tight approximation to the MAP solution of
the resulting MRF can still be found efficiently by solving
a sequence of max-flow problems. The efficiency of the inference
procedure also allows us to learn the parameters
of a MRF with global connectivity potentials by means of a
cutting plane algorithm. We experimentally evaluate our algorithm
on both synthetic data and on the challenging segmentation
task of the PASCAL VOC 2008 data set. We show
that in both cases the addition of a connectedness prior significantly
reduces the segmentation error.
In ICML 2009, pages: 769-776, (Editors: Danyluk, A. , L. Bottou, M. Littman), ACM Press, New York, NY, USA, 26th International Conference on Machine Learning, June 2009 (inproceedings)
We propose a new method to quantify the solution
stability of a large class of combinatorial
optimization problems arising in machine
learning. As practical example we apply the
method to correlation clustering, clustering
aggregation, modularity clustering, and relative
performance significance clustering. Our
method is extensively motivated by the idea
of linear programming relaxations. We prove
that when a relaxation is used to solve the
original clustering problem, then the solution
stability calculated by our method is conservative,
that is, it never overestimates the solution
stability of the true, unrelaxed problem.
We also demonstrate how our method
can be used to compute the entire path of
optimal solutions as the optimization problem
is increasingly perturbed. Experimentally,
our method is shown to perform well
on a number of benchmark problems.
In ICML 2008, pages: 704-711, (Editors: Cohen, W. W., A. McCallum, S. Roweis), ACM Press, New York, NY, USA, 25th International Conference on Machine Learning, July 2008 (inproceedings)
A recent trend in exemplar based unsupervised
learning is to formulate the learning
problem as a convex optimization problem.
Convexity is achieved by restricting the set
of possible prototypes to training exemplars.
In particular, this has been done for clustering,
vector quantization and mixture model
density estimation. In this paper we propose
a novel algorithm that is theoretically and
practically superior to these convex formulations.
This is possible by posing the unsupervised
learning problem as a single convex
master problem" with non-convex subproblems.
We show that for the above learning
tasks the subproblems are extremely wellbehaved
and can be solved efficiently.
(180), Max-Planck Institute for Biological Cybernetics, Tübingen, Germany, November 2008 (techreport)
Discovery of knowledge from geometric graph databases is of particular importance in chemistry and
biology, because chemical compounds and proteins are represented as graphs with 3D geometric coordinates. In
such applications, scientists are not interested in the statistics of the whole database. Instead they need information
about a novel drug candidate or protein at hand, represented as a query graph. We propose a polynomial-delay
algorithm for geometric frequent subgraph retrieval. It enumerates all subgraphs of a single given query graph
which are frequent geometric epsilon-subgraphs under the entire class of rigid geometric transformations in a database.
By using geometric epsilon-subgraphs, we achieve tolerance against variations in geometry. We compare the proposed
algorithm to gSpan on chemical compound data, and we show that for a given minimum support the total number
of frequent patterns is substantially limited by requiring geometric matching. Although the computation time per
pattern is larger than for non-geometric graph mining, the total time is within a reasonable level even for small
Machine Learning, 75(1):69-89, November 2008 (article)
Graph mining methods enumerate frequently appearing subgraph patterns, which can be used as features for subsequent classification or regression. However, frequent patterns are not necessarily informative for the given learning problem. We propose a mathematical programming boosting method (gBoost) that progressively collects informative patterns. Compared to AdaBoost, gBoost can build the prediction rule with fewer iterations. To apply the boosting method to graph data, a branch-and-bound pattern search algorithm is developed based on the DFS code tree. The constructed search space is reused in later iterations to minimize the computation time. Our method can learn more efficiently than the simpler method based on frequent substructure mining, because the output labels are used as an extra information source for pruning the search space. Furthermore, by engineering the mathematical program, a wide range of machine learning problems can be solved without modifying the pattern search algorithm.
In ICDM 2008, pages: 953-958, (Editors: Giannotti, F. , D. Gunopulos, F. Turini, C. Zaniolo, N. Ramakrishnan, X. Wu), IEEE Computer Society, Los Alamitos, CA, USA, 8th IEEE International Conference on Data Mining, December 2008 (inproceedings)
Discovery of knowledge from geometric graph databases is of particular importance in chemistry and biology, because chemical compounds and proteins are represented as graphs with 3D geometric coordinates. In such applications, scientists are not interested in the statistics of the whole database. Instead they need information about a novel drug candidate or protein at hand, represented as a query graph. We propose a polynomial-delay algorithm for geometric frequent subgraph retrieval. It enumerates all subgraphs of a single given query graph which are frequent geometric $epsilon$-subgraphs under the entire class of rigid geometric transformations in a database. By using geometric$epsilon$-subgraphs, we achieve tolerance against variations in geometry. We compare the proposed algorithm to gSpan on chemical compound data, and we show that for a given minimum support the total number of frequent patterns is substantially limited by requiring geometric matching. Although the computation time per pattern is lar
ger than for non-geometric graph mining,the total time is within a reasonable level even for small minimum support.
(174), Max-Planck-Institute for Biological Cybernetics, Tübingen, Germany, August 2008 (techreport)
We study the question of activity classification in videos and present a novel approach for recognizing
human action categories in videos by combining information from appearance and motion of human body parts.
Our approach uses a tracking step which involves Particle Filtering and a local non - parametric clustering step.
The motion information is provided by the trajectory of the cluster modes of a local set of particles. The statistical
information about the particles of that cluster over a number of frames provides the appearance information. Later
we use a Bag ofWords model to build one histogram per video sequence from the set of these robust appearance
and motion descriptors. These histograms provide us characteristic information which helps us to discriminate
among various human actions and thus classify them correctly.
We tested our approach on the standard KTH and Weizmann human action datasets and the results were comparable
to the state of the art. Additionally our approach is able to distinguish between activities that involve the
motion of complete body from those in which only certain body parts move. In other words, our method discriminates
well between activities with gross motion like running, jogging etc. and local motion like waving,
In Proceedings of the NIPS 2008 Workshop on "Kernel Learning: Automatic Selection of Optimal Kernels", pages: 1-4, NIPS Workshop on "Kernel Learning: Automatic Selection of Optimal Kernels" (LK ASOK´08), December 2008 (inproceedings)
In this paper we build upon the Multiple Kernel Learning (MKL) framework
and in particular on  which generalized it to infinitely many
kernels. We rewrite the problem in the standard MKL formulation which
leads to a Semi-Infinite Program. We devise a new algorithm to solve it
(Infinite Kernel Learning, IKL). The IKL algorithm is applicable to both
the finite and infinite case and we find it to be faster and more stable
than SimpleMKL . Furthermore we present the first large scale
comparison of SVMs to MKL on a variety of benchmark datasets, also
comparing IKL. The results show two things: a) for many datasets there
is no benefit in using MKL/IKL instead of the SVM classifier, thus the
flexibility of using more than one kernel seems to be of no use, b) on
some datasets IKL yields massive increases in accuracy over SVM/MKL due
to the possibility of using a largely increased kernel set. For those
cases parameter selection through Cross-Validation or MKL is not applicable.
Gehler, P., Nowozin, S.Infinite Kernel Learning
In Proceedings of the NIPS 2008 Workshop on "Kernel Learning: Automatic Selection of Optimal Kernels", pages: 1-4, NIPS Workshop on "Kernel Learning: Automatic Selection of Optimal Kernels" (LK ASOK´08), December 2008 (inproceedings)
(178), Max-Planck Institute for Biological Cybernetics, Tübingen, Germany, October 2008 (techreport)
In this paper we consider the problem of automatically learning the kernel from general kernel
classes. Specifically we build upon the Multiple Kernel Learning (MKL) framework and in particular on the work
of (Argyriou, Hauser, Micchelli, & Pontil, 2006). We will formulate a Semi-Infinite Program (SIP) to solve the
problem and devise a new algorithm to solve it (Infinite Kernel Learning, IKL). The IKL algorithm is applicable
to both the finite and infinite case and we find it to be faster and more stable than SimpleMKL (Rakotomamonjy,
Bach, Canu, & Grandvalet, 2007) for cases of many kernels. In the second part we present the first large scale
comparison of SVMs to MKL on a variety of benchmark datasets, also comparing IKL. The results show two
things: a) for many datasets there is no benefit in linearly combining kernels with MKL/IKL instead of the SVM
classifier, thus the flexibility of using more than one kernel seems to be of no use, b) on some datasets IKL yields
impressive increases in accuracy over SVM/MKL due to the possibility of using a largely increased kernel set. In
those cases, IKL remains practical, whereas both cross-validation or standard MKL is infeasible.
In ICCV 2007, pages: 1919-1923, IEEE Computer Society, Los Alamitos, CA, USA, 11th IEEE International Conference on Computer Vision, October 2007 (inproceedings)
Recent approaches to action classification in videos have
used sparse spatio-temporal words encoding local appearance
around interesting movements. Most of these approaches
use a histogram representation, discarding the
temporal order among features. But this ordering information
can contain important information about the action
itself, e.g. consider the sport disciplines of hurdle race
and long jump, where the global temporal order of motions
(running, jumping) is important to discriminate between
the two. In this work we propose to use a sequential
representation which retains this temporal order. Further,
we introduce Discriminative Subsequence Mining to find
optimal discriminative subsequence patterns. In combination
with the LPBoost classifier, this amounts to simultaneously
learning a classification function and performing feature
selection in the space of all possible feature sequences.
The resulting classifier linearly combines a small number
of interpretable decision functions, each checking for the
presence of a single discriminative pattern. The classifier is
benchmarked on the KTH action classification data set and
outperforms the best known results in the literature.
In CVPR 2007, pages: 1-8, IEEE Computer Society, Los Alamitos, CA, USA, 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2007 (inproceedings)
In web-related applications of image categorization, it is
desirable to derive an interpretable classification rule with
high accuracy. Using the bag-of-words representation and
the linear support vector machine, one can partly fulfill the
goal, but the accuracy of linear classifiers is not high and
the obtained features are not informative for users. We propose
to combine item set mining and large margin classifiers
to select features from the power set of all visual words.
Our resulting classification rule is easier to browse and simpler
to understand, because each feature has richer information.
As a next step, each image is represented as a graph
where nodes correspond to local image features and edges
encode geometric relations between features. Combining
graph mining and boosting, we can obtain a classification
rule based on subgraph features that contain more information
than the set features. We evaluate our algorithm
in a web-retrieval ranking task where the goal is to reject
outliers from a set of images returned for a keyword
query. Furthermore, it is evaluated on the supervised classification
tasks with the challenging VOC2005 data set.
Our approach yields excellent accuracy in the unsupervised
ranking task compared to a recently proposed probabilistic
model and competitive results in the supervised classification task.
Biologische Kybernetik, Technical University of Berlin, Berlin, Germany, May 2006 (diplomathesis)
Object classification in digital images remains one of the most challenging tasks in computer
vision. Advances in the last decade have produced methods to repeatably extract and describe
characteristic local features in natural images. In order to apply machine learning techniques
in computer vision systems, a representation based on these features is needed.
A set of local features is the most popular representation and often used in conjunction
with Support Vector Machines for classification problems. In this work, we examine current
approaches based on set representations and identify their shortcomings.
To overcome these shortcomings, we argue for extending the set representation into a
graph representation, encoding more relevant information. Attributes associated with the
edges of the graph encode the geometric relationships between individual features by making
use of the meta data of each feature, such as the position, scale, orientation and shape of the
feature region. At the same time all invariances provided by the original feature extraction
method are retained.
To validate the novel approach, we use a standard subset of the ETH-80 classification
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems