Jonas Peters (Group Leader),
Philipp Geiger,
Biwei Huang,
Niklas Pfister,
Eleni Sgouritsa,
Naji Shajari,
Carl Johann Simon-Gabriel,
Dominik Janzing,
Bernhard Schölkopf (Director)

The detection of statistical dependences is a core problem of statistics and machine learning. It plays a crucial role in association studies, and it generalizes the prediction tasks commonly studied by machine learning. In recent years, machine learning methods have enabled us to perform rather accurate prediction, often based on large training sets, for complex nonlinear problems that not long ago would have appeared completely random. However, in many situations we would actually prefer a causal model to a purely predictive one; i.e., a model that might tell us that a specific variable (say, whether or not a person smokes) is not just statistically associated with a disease, but it is causal for the disease. Often we are also interested in quantifying the strength of a causal influence, which is challenging also from the conceptional perspective [ ].

In order to formulate and tackle those causal problems, we use the language of structural equation models (SEMs). In structural equation models (SEMs) each variable is modeled as a deterministic function of its direct causes (sometimes called "parents") and some noise variable $N$, e.g. $Y=f(X,Z,N)$; all noise variables are assumed to be jointly independent. SEMs do not only allow us to model observational distributions; at the same time we can also use them in order to model what happens under interventions, i.e., when some of the variables are actively set to specific values (e.g. gene knockouts or randomized studies).

In the recent years, we have developed conditions under which the structure of the SEM (i.e. the causal graph) is identifiable from the available data. These assumptions include restricted structural equation models, independence of cause and mechanism and invariance principles. All of these conditions can identify the correct graph even within the Markov equivalence class. We also design corresponding estimation procedures and analyze their statistical properties. Recently, we have started to use causal ideas to improve machine learning methods.

In causal discovery (or structure learning), we try to learn the causal structure from available data (observational and/or interventional). Using the concept of SEMs, it becomes apparent that without further assumptions, this goal is impossible to achieve: if we are given only an observational distribution P, for example, we can find a SEM modeling this distribution P for any graph G, s.t. the distribution P is Markov w.r.t. G. That is, the true causal structure is not identifiable from P. In our group we investigate different assumptions that make the graph identifiable from the observational distribution. Additionally, we develop corresponding algorithms that make use of the identifiability and estimate the graph structure from finitely many data. While there are many different approaches by now [ ], we will give two illustrative examples.

**Example: Additive Noise Models**

We have mentioned that any observational distribution can be modeled by several SEMs with different graphs. For certain restrictions of the SEM, we obtain full identifiability. That is, given an observational distribution $P$, we can recover the underlying causal graph. In additive noise models the structural equations are of the form $Z=f(X,Y)+N$. The subclass of linear functions and additive Gaussian noise does not lead to identifiability. This, however, constitutes an exceptional setting. If one assumes either (i) non-Gaussian noise, (ii) non-linear functions in the SEM [ ] or (iii) all noise variables to have the same variance [ ], one can show that additive noise models are identifiable. Methods that are based on additive noise models perform well not only on artifical data but also on the set of cause-effect pairs that we have collected over the last years [ ]. A similar result holds if all variables are integer-valued [ ] or if we interpret the additivity in $\mathbf{Z}/k\mathbf{Z}$ [ ]. The concept of additive noise has been extended to time series, too [ ].

**Example: Information Geometric Causal Inference**

Although the above methods inherently rely on noisy causal relations, statistical asymmetries between cause and effect can even appear for deterministic relations. We have considered the case where $Y=f(X)$ and $X=f^{-1}(Y)$, for some invertible function $f$, where the task is to tell which variable is the cause. Applying the general principle [ ] that $P(X)$ and $P(Y|X)$ are algorithmically independent if $X$ causes $Y$, we postulate that the shortest description of $P(X,Y)$ is given by separate descriptions of $P(X)$ and $f$. Description length in the sense of Kolmogorov complexity is uncomputable, but we can easily test the following kind of dependence: choosing $P(X)$ and $f$ independently typically implies that $P(Y)$ tends to have high probability density in regions where $f^{-1}$ has large Jacobian. This observation can be made precise within an information theoretic framework [ ] showing that applying non-linear $f$ to $P(X)$ decreases entropy and increases the relative entropy distance to Gaussians, provided that a certain independence between $f$ and $P(X)$ is postulated which can be phrased as orthogonality in information space. The corresponding inference method is computationally simple and achieved positive results on real data.

While the aforementioned method requires non-linearity, there is a different approach called ``trace method'' for linear invertible relations between multi-dimensional variables that is related in spirit: if the covariance matrix of $X$ and the structure matrix relating $X$ and $Y$ are chosen independently, directions with high covariance of $Y$ tend to coincide with directions corresponding to small eigenvalues of $A^{-1}$, which can be checked by a simple formula relating traces of covariance matrices with traces of structure matrices. This way, cause and effect can be distinguished even in the Gaussian case [ ] even in the regime where the dimension exceeds the number of data points [ ]. The method turned out to be helpful even for noisy data.

**Example: Invariant Prediction**

In many situations, we are interested in the system's behavior under a change of environment. Here, causal models become important because they are usually considered invariant under those changes. A causal prediction (which uses only direct causes of the target variable as predictors) remains valid even if we intervene on predictor variables or change the whole experimental setting. In this approach, we exploit invariant prediction for causal inference: given data from different experimental settings, we use invariant models to estimate the set of causal predictors for a given target variable. This method even leads to valid confidence intervals for causal relations [ ].

In the figure, the regression model from gene 4710 on 5954 is invariant in both environments. Indeed, we finde 5954 $\rightarrow$ 4710 (possibly indirectly) since an intervention on gene 5954 changes the value of gene 4710.

**Example: Hidden Confounding in Time Series**

Assume we are given a multivariate time series $X_1,\ldots, X_L$ of measurements. In this project, our goal is to infer the causal structure underlying $X_1,\ldots, X_L$, in spite of a potential unmeasured confounder $(Z_t)_{t \in {\mathbb Z}}$. Previous approaches, such as Granger causality, often implicitly assume that there is no such hidden confounder. In contrast, we assume a vector autoregressive (VAR) causal model $$ \left( \begin{array}{c} X_t \\ Z_t \end{array} \right) := \left(\begin{array}{cc} {\color{red}B} & {\color{blue}C} \\ D & E \end{array} \right) \left( \begin{array}{c} X_{t-1} \\ Z_{t-1} \end{array} \right) + N_t, $$ see the figure on the side for the causal DAG. We prove that additionally restricting the model class to non-Gaussian, independent noise $(N_t)_{t \in {\mathbb Z}}$ makes ${\color{red}B}$ and ${\color{blue}C}$ essentially identifiable [ ]. We show that $D=0$ is another sufficient restriction of the model class for almost identifiability of ${\color{red}B}$. As a practical method we present two estimation algorithms that are tailored towards the model assumptions just mentioned. Additionally we show how these assumptions can, to some extent, be checked from only the given $X_1,\ldots, X_L$.

We believe that causal knowledge is not only useful for predicting the effect of interventions but that in some scenarios causal ideas can also improve the performance of classical machine learning methods. Again, we concentrate only on one example and refer to some other papers [ ].

**Example: Half-sibling regression**

Half-sibling regression is a method for removing the effect of confounders in order to reconstruct a latent quantity of interest. This method is inspired by the concept of additive noise models. As an example, consider the search for exo-plantes. The Kepler space observatory, launched in 2009, observes a tiny fraction of the Milky Way in search for exoplanets. Even though the telescope was pointed at a small patch, it monitored the brightness of ca. 150,000 stars. Those stars that are surrounded by a planet of sufficient size will lead to lightcurves that show a periodic decrease of light intensity. All of these measurements, however, are corrupted with systematic noise that is due to the telescope and that makes the signal from possible planets hard to detect. But because the stars can be assumed to be causally independent of each other (they are lightyears apart from each other), we can denoise the signal of a single star by removing all information that can be explained by the measurement of all other stars. Under the assumption that the systematic noise acts in an additive manner, there are several theoretical guarantees. [ ]

**Example: Semi-supervised Learning**

Our work [ ] discusses several implications of the above mentioned asymmetries between cause and effect for standard machine learning. Let us assume that $Y$ is predicted from $X$. The figure below shows ``causal'' and ``anticausal'' prediction scenario: predicting the effect from the cause or visa versa, respectively. The function $\varphi$ describes the causal mechanism. We have hypothesized that semi-supervised learning (SSL) does not help if $X$ is the cause of $Y$ as in the figure on the left, whereas it often helps if $Y$ is the cause (right). This is because additional $x$-values only tell us more about $P(X)$ -- which is irrelevant in the case of causal prediction because the prediction requires information about the ``unrelated'' object $P(Y|X)$. Our meta-study analyzing results reported in the SSL-literature supports the hypothesis: all cases where SSL helped where anticausal, confounded, or examples where the causal structure was unclear; see the figure below. We have developed a causal discovery [ ] which exploits the fact that SSL can only work in the anti-causal direction.