Sketch of the transit method for exoplanet detection. As a planet passes in front of its host star, we can observe a small dip in the apparent star brightness (image credit: NASA Ames). A simple causal model of the data generating process helped to better correct for measurement errors of the variable `brightness'.

Jonas Peters (Project Leader),
Bernhard Schölkopf (Director),
Dominik Janzing,
Biwei Huang,
Carl Johann Simon-Gabriel,
Eleni Sgouritsa,
Philipp Geiger,
Krikamol Muandet

The detection of statistical dependences is a core problem of statistics and machine learning. It plays a crucial role in association studies, and it generalizes the prediction tasks commonly studied by machine learning. In recent years, machine learning methods have enabled us to perform rather accurate prediction, often based on large training sets, for complex nonlinear problems that not long ago would have appeared completely random. However, in many situations we would actually prefer a causal model to a purely predictive one; i.e., a model that might tell us that a specific variable (say, whether or not a person smokes) is not just statistically associated with a disease, but it is causal for the disease. Often we are also interested in quantifying the strength of a causal influence, which is challenging also from the conceptional perspective [ ].

In order to formulate and tackle those causal problems, we use the language of structural equation models (SEMs). In structural equation models (SEMs) each variable is modeled as a deterministic function of its direct causes (sometimes called "parents") and some noise variable $N$, e.g. $Y=f(X,Z,N)$; all noise variables are assumed to be jointly independent. SEMs do not only allow us to model observational distributions; at the same time we can also use them in order to model what happens under interventions, i.e., when some of the variables are actively set to specific values (e.g. gene knockouts or randomized studies).

In the recent years, we have developed conditions under which the structure of the SEM (i.e. the causal graph) is identifiable from the available data. These assumptions include restricted structural equation models, independence of cause and mechanism and invariance principles. All of these conditions can identify the correct graph even within the Markov equivalence class. We also design corresponding estimation procedures and analyze their statistical properties. Recently, we have started to use causal ideas to improve machine learning methods.

\paragraph{Causal Discovery}

In causal discovery (or structure learning), we try to learn the causal structure from available data (observational and/or interventional). Using the concept of SEMs, it becomes apparent that without further assumptions, this goal is impossible to achieve: if we are given only an observational distribution P, for example, we can find a SEM modeling this distribution P for any graph G, s.t. the distribution P is Markov w.r.t. G. That is, the true causal structure is not identifiable from P. In our group we investigate different assumptions that make the graph identifiable from the observational distribution. Additionally, we develop corresponding algorithms that make use of the identifiability and estimate the graph structure from finitely many data. While there are many different approaches by now [ ], we will give two illustrative examples.

*{\it Example (1): Additive Noise Models*.} We have mentioned that any observational distribution can be modeled by several SEMs with different graphs. For certain restrictions of the SEM, we obtain full identifiability. That is, given an observational distribution $P$, we can recover the underlying causal graph. In additive noise models the structural equations are of the form $Z=f(X,Y)+N$. The subclass of linear functions and additive Gaussian noise does not lead to identifiability. This, however, constitutes an exceptional setting. If one assumes either (i) non-Gaussian noise, (ii) non-linear functions in the SEM [ ] or (iii) all noise variables to have the same variance, one can show that additive noise models are identifiable. Methods that are based on additive noise models perform well not only on artifical data but also on the set of cause-effect pairs that we have collected over the last years. A similar result holds if all variables are integer-valued [ ] or if we interpret the additivity in $\mathbf{Z}/k\mathbf{Z}$ [ ]. The concept of additive noise has been extended to time series, too [ ].

{\it Example (2): Information Geometric Causal Inference (IGCI).} Although the above methods inherently rely on noisy causal relations, statistical asymmetries between cause and effect can even appear for deterministic relations. We have considered the case where $Y=f(X)$ and $X=f^{-1}(Y)$, for some invertible function $f$, where the task is to tell which variable is the cause. Applying the general principle [ ] that $P(X)$ and $P(Y|X)$ are algorithmically independent if $X$ causes $Y$, we postulate that the shortest description of $P(X,Y)$ is given by separate descriptions of $P(X)$ and $f$. Description length in the sense of Kolmogorov complexity is uncomputable, but we can easily test the following kind of dependence: choosing $P(X)$ and $f$ independently typically implies that $P(Y)$ tends to have high probability density in regions where $f^{-1}$ has large Jacobian. This observation can be made precise within an information theoretic framework [ ] showing that applying non-linear $f$ to $P(X)$ decreases entropy and increases the relative entropy distance to Gaussians, provided that a certain independence between $f$ and $P(X)$ is postulated which can be phrased as orthogonality in information space. The corresponding inference method is computationally simple and performed well for simulated deterministic relations. The performance for real data (which are usually noisy) heavily depends on various conditions. However, IGCI is an insightful toy example for explaining why independence of P(effect|cause) and P(cause) typically entails dependences between P(cause|effect) and P(effect). This insight suggests, for instance, new approaches to semi-supervised learning [ ], see also below.

While the aforementioned method requires non-linearity, there is a different approach called ``trace method'' for linear invertible relations between multi-dimensional variables that is related in spirit: if the covariance matrix of $X$ and the structure matrix relating $X$ and $Y$ are chosen independently, directions with high covariance of $Y$ tend to coincide with directions corresponding to small eigenvalues of $A^{-1}$, which can be checked by a simple formula relating traces of covariance matrices with traces of structure matrices. This way, cause and effect can be distinguished even in the Gaussian case [ ] even in the regime where the dimension exceeds the number of data points [ ]. The method turned out to be helpful even for noisy data.

{\it Example (3): Invariant Prediction.} In many situations, we are interested in the system's behavior under a change of environment. Here, causal models become important because they are usually considered invariant under those changes. A causal prediction (which uses only direct causes of the target variable as predictors) remains valid even if we intervene on predictor variables or change the whole experimental setting. In this approach, we exploit invariant prediction for causal inference: given data from different experimental settings, we use invariant models to estimate the set of causal predictors for a given target variable. This method even leads to valid confidence intervals for causal relations [ ].

(a figure will be added here once it's possible to place figures inside the text)

In the figure, the regression model from gene 4710 on 5954 is invariant in both environments. Indeed, we finde 5954 $\rightarrow$ 4710 (possibly indirectly) since an intervention on gene 5954 changes the value of gene 4710.

{\it** **Example (4): Hidden Confounding in Time Series.} Assume we are given a multivariate time series $X_1,\ldots, X_L$ of measurements. In this project, our goal is to infer the causal structure underlying $X_1,\ldots, X_L$, in spite of a potential unmeasured confounder $(Z_t)_{t \in {\mathbb Z}}$. Previous approaches, such as Granger caousality, often implicitly assume that there is no such hidden confounder. In contrast, we assume a vector autoregressive (VAR) causal model $$ \left( \begin{array}{c} X_t \\ Z_t \end{array} \right) := \left(\begin{array}{cc} {\color{red}B} & {\color{blue}C} \\ D & E \end{array} \right) \left( \begin{array}{c} X_{t-1} \\ Z_{t-1} \end{array} \right) + N_t, $$ see the figure on the side for the causal DAG. We prove that additionally restricting the model class to non-Gaussian, independent noise $(N_t)_{t \in {\mathbb Z}}$ makes ${\color{red}B}$ and ${\color{blue}C}$ essentially identifiable [ ]. We show that $D=0$ is another sufficient restriction of the model class for almost identifiability of ${\color{red}B}$. As a practical method we present two estimation algorithms that are tailored towards the model assumptions just mentioned. Additionally we show how these assumptions can, to some extent, be checked from only the given $X_1,\ldots, X_L$.

\paragraph{Causal Inference in Machine Learning}

We believe that causal knowledge is not only useful for predicting the effect of interventions but that in some scenarios causal ideas can also improve the performance of classical machine learning methods. Again, we concentrate only on two examples and refer to some other papers [ ].

{\it** **Example (5): Half-sibling regression.} Half-sibling regression is a method for removing the effect of confounders in order to reconstruct a latent quantity of interest. This method is inspired by the concept of additive noise models. As an example, consider the search for exo-plantes. The Kepler space observatory, launched in 2009, observes a tiny fraction of the Milky Way in search for exoplanets. Even though the telescope was pointed at a small patch, it monitored the brightness of ca. 150,000 stars. Those stars that are surrounded by a planet of sufficient size will lead to lightcurves that show a periodic decrease of light intensity, see Figure. All of these measurements, however, are corrupted with systematic noise that is due to the telescope and that makes the signal from possible planets hard to detect. But because the stars can be assumed to be causally independent of each other (they are lightyears apart from each other), we can denoise the signal of a single star by removing all information that can be explained by the measurement of all other stars. Under the assumption that the systematic noise acts in an additive manner, there are several theoretical guarantees. [ ]

{\it Example (6): Semi-supervised Learning.} Our recent work [ ] discusses several implications of the above mentioned asymmetries between cause and effect for standard machine learning. Let us assume that $Y$ is predicted from $X$. The figure below shows ``causal'' and ``anticausal'' prediction scenario: predicting the effect from the cause or visa versa, respectively. The function $\varphi$ describes the causal mechanism. We have hypothesized that semi-supervised learning (SSL) does not help if $X$ is the cause of $Y$, whereas it often helps if $Y$ is the cause. This is because additional $x$-values only tell us more about $P(X)$ -- which is irrelevant in the case of causal prediction because the prediction requires information about the ``unrelated'' object $P(Y|X)$. Our meta-study analyzing results reported in the SSL-literature supports the hypothesis: all cases where SSL helped where anticausal, confounded, or examples where the causal structure was unclear; see the figure below. We have developed a causal discovery [ ] which exploits the fact that SSL can only work in the anti-causal direction. To elaborate on the link between causal direction and performance of SSL, [ ] studies the toy problem of interpolating a monotonically increasing function for the case where the relation between $X$ and $Y$ is deterministic. In such a scenario $P(X)$ can be shown to be beneficial for predicting $Y$ from $X$ whenever $P(X|Y)$ and $P(Y)$ satisfy a certain independence condition which coincides with the one postulated in Information-Geometric Causal Inference [ ].

31 results

**Semi-Supervised Interpolation in an Anticausal Learning Scenario**
*Journal of Machine Learning Research*, 16, pages: 1923-1948, September 2015 (article)

**Removing systematic errors for exoplanet search via latent causes**
In *Proceedings of The 32nd International Conference on Machine Learning*, 37, pages: 2218–2226, JMLR Workshop and Conference Proceedings, (Editors: Bach, F. and Blei, D.), JMLR, ICML, 2015 (inproceedings)

**Causal Inference by Identification of Vector Autoregressive Processes with Hidden Components**
In *Proceedings of the 32nd International Conference on Machine Learning*, 37, pages: 1917–1925, JMLR Workshop and Conference Proceedings, (Editors: F. Bach and D. Blei), JMLR, ICML, 2015 (inproceedings)

**Telling cause from effect in deterministic linear dynamical systems**
In *Proceedings of the 32nd International Conference on Machine Learning*, 37, pages: 285–294, JMLR Workshop and Conference Proceedings, (Editors: F. Bach and D. Blei), JMLR, ICML, 2015 (inproceedings)

**Inference of Cause and Effect with Unsupervised Inverse Regression**
In *Proceedings of the 18th International Conference on Artificial Intelligence and Statistics*, 38, pages: 847-855, JMLR Workshop and Conference Proceedings, (Editors: Lebanon, G. and Vishwanathan, S.V.N.), JMLR.org, AISTATS, 2015 (inproceedings)

**Estimating Causal Effects by Bounding Confounding**
In *Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence *, pages: 240-249 , (Editors: Nevin L. Zhang and Jin Tian), AUAI Press Corvallis, Oregon , UAI, 2014 (inproceedings)

**Causal Discovery with Continuous Additive Noise Models **
*Journal of Machine Learning Research*, 15, pages: 2009-2053, 2014 (article)

**From Ordinary Differential Equations to Structural Causal Models: the deterministic case **
In *Proceedings of the Twenty-Ninth Conference Annual Conference on Uncertainty in Artificial Intelligence*, pages: 440-448, (Editors: A Nicholson and P Smyth), AUAI Press, Corvallis, Oregon, UAI, 2013 (inproceedings)

**Quantifying causal influences**
*Annals of Statistics*, 41(5):2324-2358, 2013 (article)

**Identifying Finite Mixtures of Nonparametric Product Distributions and Causal Inference of Confounders **
In *Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages: 556-565, (Editors: A Nicholson and P Smyth), AUAI Press Corvallis, Oregon, USA, UAI, 2013 (inproceedings)

**Causal Inference on Time Series using Restricted Structural Equation Models**
In *Advances in Neural Information Processing Systems 26*, pages: 154-162, (Editors: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger), 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013 (inproceedings)

**Information-geometric approach to inferring causal directions**
*Artificial Intelligence*, 182-183, pages: 1-31, May 2012 (article)

**On Causal and Anticausal Learning**
In *Proceedings of the 29th International Conference on Machine Learning*, pages: 1255-1262, (Editors: J Langford and J Pineau), Omnipress, New York, NY, USA, ICML, 2012 (inproceedings)

**Causal Inference on Discrete Data using Additive Noise Models**
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 33(12):2436-2450, December 2011 (article)

**Detecting low-complexity unobserved causes**
In pages: 383-391, (Editors: FG Cozman and A Pfeffer), AUAI Press, Corvallis, OR, USA, 27th Conference on Uncertainty in Artificial Intelligence (UAI), July 2011 (inproceedings)

**Identifiability of causal graphs using functional models**
In pages: 589-598, (Editors: FG Cozman and A Pfeffer), AUAI Press, Corvallis, OR, USA, 27th Conference on Uncertainty in Artificial Intelligence (UAI), July 2011 (inproceedings)

**Kernel-based Conditional Independence Test and Application in Causal Discovery**
In pages: 804-813, (Editors: FG Cozman and A Pfeffer), AUAI Press, Corvallis, OR, USA, 27th Conference on Uncertainty in Artificial Intelligence (UAI), July 2011 (inproceedings)

**Testing whether linear equations are causal: A free probability theory approach**
In pages: 839-847, (Editors: Cozman, F.G. , A. Pfeffer), AUAI Press, Corvallis, OR, USA, 27th Conference on Uncertainty in Artificial Intelligence (UAI), July 2011 (inproceedings)

**k-NN Regression Adapts to Local Intrinsic Dimension**
In *Advances in Neural Information Processing Systems 24*, pages: 729-737, (Editors: J Shawe-Taylor and RS Zemel and P Bartlett and F Pereira and KQ Weinberger), Twenty-Fifth Annual Conference on Neural Information Processing Systems (NIPS), 2011 (inproceedings)

**Causal Inference Using the Algorithmic Markov Condition**
*IEEE Transactions on Information Theory*, 56(10):5168-5194, October 2010 (article)

**Justifying Additive Noise Model-Based Causal Discovery via Algorithmic Information Theory**
*Open Systems and Information Dynamics*, 17(2):189-212, June 2010 (article)

**Inferring deterministic causal relations**
In *Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence*, pages: 143-150, (Editors: P Grünwald and P Spirtes), AUAI Press, Corvallis, OR, USA, UAI, July 2010 (inproceedings)

**Telling cause from effect based on high-dimensional observations**
In *Proceedings of the 27th International Conference on Machine Learning*, pages: 479-486, (Editors: J Fürnkranz and T Joachims), International Machine Learning Society, Madison, WI, USA, ICML, June 2010 (inproceedings)

**Probabilistic latent variable models for distinguishing between cause and effect**
In *Advances in Neural Information Processing Systems 23*, pages: 1687-1695, (Editors: J Lafferty and CKI Williams and J Shawe-Taylor and RS Zemel and A Culotta), Curran, Red Hook, NY, USA, 24th Annual Conference on Neural Information Processing Systems (NIPS), 2010 (inproceedings)

**Identifying Cause and Effect on Discrete Data using Additive Noise Models**
In *JMLR Workshop and Conference Proceedings Volume 9: AISTATS 2010*, pages: 597-604, (Editors: YW Teh and M Titterington), JMLR, Cambridge, MA, USA, 13th International Conference on Artificial Intelligence and Statistics, May 2010 (inproceedings)

**Causal Markov condition for submodular information measures**
In *Proceedings of the 23rd Annual Conference on Learning Theory*, pages: 464-476, (Editors: AT Kalai and M Mohri), OmniPress, Madison, WI, USA, COLT, June 2010 (inproceedings)

**Robot Learning**
*IEEE Robotics and Automation Magazine*, 16(3):19-20, September 2009 (article)

**Nonlinear causal discovery with additive noise models**
In *Advances in neural information processing systems 21*, pages: 689-696, (Editors: D Koller and D Schuurmans and Y Bengio and L Bottou), Curran, Red Hook, NY, USA, 22nd Annual Conference on Neural Information Processing Systems (NIPS), June 2009 (inproceedings)

**Identifying confounders using additive noise models**
In *Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence*, pages: 249-257, (Editors: J Bilmes and AY Ng), AUAI Press, Corvallis, OR, USA, UAI, June 2009 (inproceedings)

**Regression by dependence minimization and its application to causal inference in additive noise models**
In *Proceedings of the 26th International Conference on Machine Learning*, pages: 745-752, (Editors: A Danyluk and L Bottou and M Littman), ACM Press, New York, NY, USA, ICML, June 2009 (inproceedings)

**Nonlinear directed acyclic structure learning with weakly additive noise models**
In *Advances in Neural Information Processing Systems 22*, pages: 1847-1855, (Editors: Bengio, Y. , D. Schuurmans, J. Lafferty, C. Williams, A. Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)