Machine learning algorithms are designed to yield accurate predictions on future data by generalizing from past observations. The main goal of learning theory is to analyze their statistical and computational properties, and to provide performance guarantees on the new data. To do so, it poses these tasks in a rigorous mathematical framework and deals with them under various assumptions on the data-generating process.
PAC-Bayesian analysis (Probably Approximately Correct) provides guarantees on the generalization ability of randomized predictors within the classical PAC framework. We derived general PAC-Bayesian inequalities which allow to control the concentration of weighted averages of possibly uncountably many simultaneously evolving and interdependent martingales [ ]. We applied these inequalities to contextual bandits providing a new regret bound and learning algorithm [ ].
We have studied the relations and differences between Vapnik-Chervonenkis (VC) and Popper's dimensions. While both VC and Popper's dimensions aim to quantify the process of falsification, the first focuses on passive falsification based on examples provided by nature, whereas the latter focuses on active falsification through experiment design. We have shown that concepts that are related to Popper\^as notion of falsifiability occur in the domain of query-learning and derived relations between Popper\^as, exclusion, and VC-dimension [ ].
We have analyzed how the performance of learning algorithms adapts to local structure in the data, which often helps when the ambient dimension is huge. We established that certain regression procedures, such as kernel and k-NN regression, adapt to a locally low intrinsic dimension [ ]. Further, we have shown how to improve regression performance by adapting the learning process to differing variability in various coordinates of the function to be estimated [ ]. For this, we proposed a novel, consistent estimator for gradient norms. We also developed new data-structures for regression in a streaming setting [ ].
We showed the first statistical consistency result (under surprisingly general distributional conditions) for additive noise methods in causal inference [ ]. Further, we posed the problem of cause-effect inference as binary classification of probability measures, proposed a computationally efficient algorithm based on kernel mean embeddings achieving state-of-the-art performance, and provided a risk bound for the approach [ ] .
In transductive learning the learner observes labeled training and unlabeled test points with the final goal of correctly labeling the test points. We have introduced a new complexity measure of function classes called Permutational Rademacher Complexity, argued that it is better suited for analysis of transductive learning than other popular measures of complexity, and derived a general risk bound for transductive algorithms based on it [ ].
Active learning exploits structure and information in unlabeled data to save label supervision. The goal is to ask for only a small subset of the data to be labeled while maintaining good performance. We provided novel statistical guarantees for a large class of practical cluster-based active learning procedures [ ]. We further showed that an algorithmic technique from margin-based active learning yields a computationally efficient learner for linear separators under conditions that had earlier been shown to allow for fast statistical learning rates (Massart noise condition) [ ].
We also provided a first formal analysis of the use of active learning for domain adaptation, that is, to use a learner's active querying ability to adapt to changes in the data generation. We proposed a new active learning strategy and proved that it consistently learns in situations where no passive learning algorithm is consistent, while automatically adapting its amount of label queries to what is needed due to the underlying data shift [ ].