"Why Correlation Usually != Causation" by Gwern Branwen
"Do we still need models or just more data and compute?" by Max Welling
"ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus" by Ferenc Huszar
"Causal Inference 2: Illustrating Interventions via a Toy Example" by Ferenc Huszar
"Causal Inference 3: Counterfactuals" by Ferenc Huszar
"Causal Data Science" by Adam Kelleher:
- "If Correlation Doesn’t Imply Causation, Then What Does?"
- "Understanding Bias: A Prerequisite For Trustworthy Results"
- "Speed vs. Accuracy: When Is Correlation Enough? When Do You Need Causation?"
- "A Technical Primer on Causality"
- "The Data Processing Inequality"
- "Causal Graph Inference"
"If Correlation Doesn’t Imply Causation, then What Does?" by Michael Nielsen
"Latent Variables and Model Mis-specification" by Jacob Steinhardt
"Causality in Machine Learning" by Muralidharan et al.
"The Seven Tools of Causal Inference with Reflections on Machine Learning" by Judea Pearl paper
(talk video
)
"Theoretical Impediments to Machine Learning" by Judea Pearl paper
"Causality for Machine Learning" by Bernhard Scholkopf paper
"Towards Causal Representation Learning" by Scholkopf et al. paper
"On Pearl’s Hierarchy and the Foundations of Causal Inference" by Bareinboim, Correa, Ibeling, Icard paper
(talk video
)
"Causality" by Ricardo Silva paper
"Introduction to Causal Inference" by Peter Spirtes paper
"Graphical Causal Models" by Cosma Shalizi paper
"The Book of Why: The New Science of Cause and Effect" by Judea Pearl and Dana Mackenzie book
(overview)
"Causal Inference in Statistics: A Primer" by Judea Pearl, Madelyn Glymour, Nicholas Jewell book
"Causality: Models, Reasoning, and Inference" by Judea Pearl book
(epilogue)
"Elements of Causal Inference" by Jonas Peters, Dominik Janzing, Bernhard Scholkopf book
"Causal Inference Book" by Miguel Hernan and James Robins book
tutorial by Bernhard Scholkopf video
tutorial by Jonas Peters video
tutorial by Jonas Peters video
course by Brady Neal video
"Causal Inference in Everyday Machine Learning" tutorial by Ferenc Huszar video
"Causal Inference in Online Systems: Methods, Pitfalls and Best Practices" tutorial by Amit Sharma video
(slides)
"Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement" tutorial by Thorsten Joachims and Adith Swaminathan video
"Counterfactual Reasoning and Massive Data Sets" by Leon Bottou video
"Counterfactual Inference" tutorial by Susan Athey video
"Causal Inference for Observational Studies" tutorial by David Sontag and Uri Shalit video
(slides)
"Connections between Causality and Machine Learning" by Jonas Peters video
"Science vs Data: Contesting the Soul of Data Science" by Judea Pearl video
"The Foundations of Causal Inference with Reflections on Machine Learning and Artificial Intelligence" by Judea Pearl video
"The New Science of Cause and Effect" by Judea Pearl video
"The Mathematics of Causal Inference with Reflections on Machine Learning" by Judea Pearl video
"The Mathematics of Causal Inference, with Reflections on Machine Learning and the Logic of Science" by Judea Pearl video
"On the Causal Foundations of AI (Explainability & Decision-Making)" by Elias Bareinboim video
"Causal Data Science: A General Framework for Data Fusion and Causal Inference" by Elias Bareinboim video
"Towards Causal Reinforcement Learning" ([1], [2]) by Elias Bareinboim video
"Causal Reinforcement Learning" by Elias Bareinboim video
"Learning Causal Mechanisms" by Bernhard Scholkopf video
"The Role of Causality for Interpretability" by Bernhard Scholkopf video
"Causal Learning" by Bernhard Scholkopf video
"Toward Causal Machine Learning" by Bernhard Scholkopf video
"Statistical and Causal Approaches to Machine Learning" by Bernhard Scholkopf video
"The Missing Signal" by Leon Bottou video
"Learning Representations Using Causal Invariance" by Leon Bottou video
workshop at NeurIPS 2018 (videos)
symposium at AAAI 2019
Causal inference is a problem of uncovering cause-effect relations between variables of data generating system. Causal structures provide understanding about how the system will behave under changing and unseen environments. Knowledge about these causal dynamics allows to answer "what if" questions, describing potential responses of the system under hypothetical manipulations and interventions.
What if some railways are closed, what will passengers do? What if we incentivize members of a social network to propagate an idea, how influential can they be? What if some genes in a cell are knocked-out, which phenotypes can we expect? Such questions need to be addressed via a combination of experimental and observational data, and require a careful approach to modelling heterogeneous datasets and structural assumptions concerning the causal relations among components of the system.
Causal model is a set of assumptions about the data generating process, which cannot be expressed as properties of the joint distribution of observed variables.
"In retrospect, my greatest challenge was to break away from probabilistic thinking and accept, first, that people are not probability thinkers but cause-effect thinkers and, second, that causal thinking cannot be captured in the language of probability; it requires a formal language of its own."
"What is more likely, that a daughter will have blue eyes given that her mother has blue eyes or the other way around — that the mother will have blue eyes given that the daughter has blue eyes? Most people will say the former — they'll prefer the causal direction. But it turns out the two probabilities are the same, because the number of blue-eyed people in every generation remains stable. I took it as evidence that people think causally, not probabilistically — they're biased by having easy access to causal explanations, even though probability theory tells you something different.
There are many biases in our judgment that are created by our inclination to attribute causal relationships where they do not belong. We see the world as a collection of causal relationships and not as a collection of statistical or associative relationships. Most of the time, we can get by, because they are closely tied together. Once in a while we fail. The blue-eye story is an example of such failure.
The slogan, "Correlation doesn't imply causation" leads to many paradoxes. For instance, the size of a child's thumb is highly correlated with their reading ability. So, naively, if you want to be taller, you should learn to read better. This kind of paradoxical example convinces us that correlation does not imply causation. Still, people fall into that trap quite often because they crave causal explanations. The mind is a causal processor, not an association processor. Once you acknowledge that, the question remains how we reconcile the discrepancies between the two. How do we organize causal relationships in our mind? How do we operate on and update such a mental presentation?"
"I now take causal relations as the fundamental building block that of physical reality and of human understanding of that reality, and I regard probabilistic relationships as but the surface phenomena of the causal machinery that underlies and propels our understanding of our world."
(Judea Pearl)
"If we examine the information that drives machine learning today, we find that it is almost entirely statistical. In other words, learning machines improve their performance by optimizing parameters over a stream of sensory inputs received from the environment. It is a slow process, analogous in many respects to the evolutionary survival-of-the-fittest process that explains how species like eagles and snakes have developed superb vision systems over millions of years. It cannot explain however the super-evolutionary process that enabled humans to build eyeglasses and telescopes over barely one thousand years. What humans possessed that other species lacked was a mental representation, a blue-print of their environment which they could manipulate at will to imagine alternative hypothetical environments for planning and learning. Anthropologists like N. Harari, and S. Mithen are in general agreement that the decisive ingredient that gave our homo sapiens ancestors the ability to achieve global dominion, about 40,000 years ago, was their ability to sketch and store a representation of their environment, interrogate that representation, distort it by mental acts of imagination and finally answer “What if?” kind of questions. Examples are interventional questions: “What if I act?” and retrospective or explanatory questions: “What if I had acted differently?” No learning machine in operation today can answer such questions about actions not taken before. Moreover, most learning machine today do not utilize a representation from which such questions can be answered. We postulate that the major impediment to achieving accelerated learning speeds as well as human level performance can be overcome by removing these barriers and equipping learning machines with causal reasoning tools. This postulate would have been speculative twenty years ago, prior to the mathematization of counterfactuals. Not so today. Advances in graphical and structural models have made counterfactuals computationally manageable and thus rendered metastatistical learning worthy of serious exploration."
"An extremely useful insight unveiled by the logic of causal reasoning is the existence of a sharp classification of causal information, in terms of the kind of questions that each class is capable of answering. The classification forms a 3-level hierarchy in the sense that questions at one level can only be answered if information from next levels is available."
-
association P(y|x) - seeing (what is?)
How would seeing X change my belief in Y?
What does a symptom tell me about a disease? -
intervention P(y|do(x),z) - doing (what if?)
What if I do X?
What if I take aspirin, will my headache be cured?
What if we ban cigarettes? -
counterfactuals P(yx|x0,y0) - imagining, retrospection (why?)
Was it X that caused Y?
What if I had acted differently?
Was it the aspirin that stopped my headache?
What if I had not been smoking the past 2 years?
"The first level, Association, invokes purely statistical relationships, defined by the naked data. For instance, observing a customer who buys toothpaste makes it more likely that he/she buys floss; such association can be inferred directly from the observed data using conditional expectation. Questions at this layer, because they require no causal information, are placed at the bottom level on the hierarchy.
The second level, Intervention, ranks higher than Association because it involves not just seeing what is, but changing what we see. A typical question at this level would be: What happens if we double the price? Such questions cannot be answered from sales data alone, because they involve a change in customers behavior, in reaction to the new pricing. These choices may differ substantially from those taken in previous price-raising situations. Unless we replicate precisely the market conditions that existed when the price reached double its current value.
The third level, Counterfactuals, is placed at the top of the hierarchy because they subsume interventional and associational questions. A typical question in the counterfactual category is “What if I had acted differently” thus necessitating retrospective reasoning.
If we have a model that can answer counterfactual queries, we can also answer questions about interventions and observations. For example, the interventional question “What will happen if we double the price?” can be answered by asking the counterfactual question: “What would happen had the price been twice its current value?” Likewise, associational questions can be answered once we can answer interventional questions; we simply ignore the action part and let observations take over.
The translation does not work in the opposite direction. Interventional questions cannot be answered from purely observational information (i.e., from statistical data alone). No counterfactual question involving retrospection can be answered from purely interventional information, such as that acquired from controlled experiments; we cannot re-run an experiment on subjects who were treated with a drug and see how they behave had they not given the drug."
tuple (d1, d2, d4, d4) - (population, observational/experimental, sampling, measure)
(Los Angeles, experimental with randomized Z1, selection on Age, (X1, Z1, W, M, Y1))
(New York, observational, selection on SES, (X1, X2, Z1, N, Y2))
(Texas, experimental with randomized Z2, (X2, Z1, W, L, M, Y1))
statistics - descriptive:
(d1, samples(observations), d3, d4) -> (d1, distribution(observations), d3, d4) (Bernulli, Poisson, Kolmogorov)
statistics - experimental:
(d1, samples(do(X)), d3, d4) -> (d1, distribution(do(X)), d3, d4) (Fisher, Cox, Goodman)
causal inference from observational studies:
(d1, distribution(observations), d3, d4) -> (d1, distribution(do(X)), d3, d4) (Rubin, Robins, Dawid, Pearl)
experimental inference (generalized instrumental variables):
(d1, distribution(do(Z)), d3, d4) -> (d1, distribution(do(X)), d3, d4) (P. Wright, S. Wright)
sampling selection bias:
(d1, d2, select(Age), d4) -> (d1, d2, {}, d4) (Heckman)
transportability (external validity):
(bonobos, d2, d3, d4) -> (humans, d2, d3, d4) (Shadish, Cook, Campbell)
"Under probabilistic interpretation of causation from Pearl, the causal structure underlying a set of random variables X=(X1, ..., Xd), with joint distribution P, is often described in terms of a Directed Acyclic Graph, denoted by G = (V, E). In this graph, each vertex Vi ∈ V is associated to the random variable Xi ∈ X, and an edge Eji ∈ E from Vj to Vi denotes the causal relationship “Xi ← Xj”. More specifically, these causal relationships are defined by a structural equation model: each Xi ← fi(Pa(Xi), Ni), where fi is a function, Pa(Xi) is the parental set of Vi ∈ V, and Ni is some independent noise variable. Then, causal inference is the task of recovering G from S ∼ P^n."
"Causal graph and the intervention types and targets may be (partially) unknown. This is a realistic setting in many practical applications. For example, in biology, many interventions that can be performed on organisms are known to result in measurable downstream effects, but the exact mechanism and direct intervention targets are unknown, and therefore it is not clear whether the knowledge gained may be transferred to other species. In pharmaceutical research, it is desirable to target the root causes of illness directly and minimize side-effects; however, as the causal mechanisms are often poorly understood, it is unclear what exactly a drug is doing and whether the results of a particular study on a subpopulation of patients (say, middle-aged males in the US) will generalize to other subpopulations (e.g., elderly women with dementia). In policy decisions, changing tax rules may have different repercussions for different socio-economic classes, but the exact workings of an economy can only be modeled to a certain extent. Machine learning may help to make such predictions more data-driven, but should then correctly take into account the transfer of distributions that result from interventions and context changes. For prediction in IID setting, imitating the exterior of a process is enough (i.e. can disregard causal structure). Anything else can benefit from causal learning."
"The dramatic success in machine learning has led to an explosion of artificial intelligence applications and increasing expectations for autonomous systems that exhibit human-level intelligence. These expectations have, however, met with fundamental obstacles that cut across many application areas. One such obstacle is adaptability, or robustness. Machine learning researchers have noted current systems lack the ability to recognize or react to new circumstances they have not been specifically programmed or trained for."
video
https://youtube.com/watch?v=nWaM6XmQEmU (Pearl)
"Causality for Machine Learning" Scholkopf
"Graphical causal inference as pioneered by Judea Pearl arose from research on artificial intelligence, and for a long time had little connection to the field of machine learning. This article discusses where links have been and should be established, introducing key concepts along the way. It argues that the hard open problems of machine learning and AI are intrinsically related to causality, and explains how the field is beginning to understand them."
"Causal Inference and the Data-fusion Problem" Bareinboim, Pearl
"We review concepts, principles, and tools that unify current approaches to causal analysis and attend to new challenges presented by big data. In particular, we address the problem of data fusion - piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone. However, the biases that emerge in heterogeneous environments require new analytical tools. Some of these biases, including confounding, sampling selection, and cross-population biases, have been addressed in isolation, largely in restricted parametric models. We here present a general, nonparametric framework for handling these biases and, ultimately, a theoretical solution to the problem of data fusion in causal inference tasks."
video
https://youtube.com/watch?v=_cNbWuErsoI (Bareinboim)video
https://youtube.com/watch?v=dUsokjG4DHc (Bareinboim)
"On Causal and Anticausal Learning" Schoelkopf et al.
ICML 2012
"We consider the problem of function estimation in the case where an underlying causal model can be inferred. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results."
video
https://youtu.be/zo4oRqfMrgo?t=15m58s (Lipton)
"Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising" Bottou et al.
"This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine."
video
https://youtube.com/watch?v=qmQceWeYg04 (Bottou)video
https://youtube.com/watch?v=W8k5KqYqVBw (Bottou)video
https://youtube.com/watch?v=isGAY9ELqyo (Bottou)video
https://youtu.be/_RtxTpOb8e4?t=52m6s (Huszar)
"Causal Bootstrapping" Little, Badawy
"To draw scientifically meaningful conclusions and build reliable engineering models of quantitative phenomena, statistical models must take cause and effect into consideration (either implicitly or explicitly). This is particularly challenging when the relevant measurements are not obtained from controlled experimental (interventional) settings, so that cause and effect can be obscured by spurious, indirect influences. Modern predictive techniques from machine learning are capable of capturing high-dimensional, complex, nonlinear relationships between variables while relying on few parametric or probabilistic modelling assumptions. However, since these techniques are associational, applied to observational data they are prone to picking up spurious influences from non-experimental (observational) data, making their predictions unreliable. Techniques from causal inference, such as probabilistic causal diagrams and do-calculus, provide powerful (nonparametric) tools for drawing causal inferences from such observational data. However, these techniques are often incompatible with modern, nonparametric machine learning algorithms since they typically require explicit probabilistic models. Here, we develop causal bootstrapping, a set of techniques for augmenting classical nonparametric bootstrap resampling with information about the causal relationship between variables. This makes it possible to resample observational data such that, if it is possible to identify an interventional relationship from that data, new data representing that relationship can be simulated from the original observational data. In this way, we can use modern machine learning algorithms unaltered to make statistically powerful, yet causally-robust, predictions. We develop several causal bootstrapping algorithms for drawing interventional inferences from observational data, for classification and regression problems, and demonstrate, using synthetic and real-world examples, the value of this approach."
"Discovering Causal Signals in Images" Lopez-Paz, Nishihara, Chintala, Scholkopf, Bottou
"This paper establishes the existence of observable footprints that reveal the "causal dispositions" of the object categories appearing in collections of images. We achieve this goal in two steps. First, we take a learning approach to observational causal discovery, and build a classifier that achieves state-of-the-art performance on finding the causal direction between pairs of random variables, given samples from their joint distribution. Second, we use our causal direction classifier to effectively distinguish between features of objects and features of their contexts in collections of static images. Our experiments demonstrate the existence of a relation between the direction of causality and the difference between objects and their contexts, and by the same token, the existence of observable signals that reveal the causal dispositions of objects."
"First, we take a learning approach to observational causal inference, and build a classifier that achieves state-of-the-art performance on finding the causal direction between pairs of random variables, when given samples from their joint distribution. Second, we use our causal direction finder to effectively distinguish between features of objects and features of their contexts in collections of static images. Our experiments demonstrate the existence of (1) a relation between the direction of causality and the difference between objects and their contexts, and (2) observable causal signals in collections of static images."
"Causal features are those that cause the presence of the object of interest in the image (that is, those features that cause the object’s class label), while anticausal features are those caused by the presence of the object in the image (that is, those features caused by the class label)."
"Paper aims to verify experimentally that the higher-order statistics of image datasets can inform about causal relations. Authors conjecture that object features and anticausal features are closely related and vice-versa context features and causal features are not necessarily related. Context features give the background while object features are what it would be usually inside bounding boxes in an image dataset."
"Better algorithms for causal direction should, in principle, help learning features that generalize better when the data distribution changes. Causality should help with building more robust features by awareness of the generating process of the data."
video
https://youtube.com/watch?v=DfJeaa--xO0 (Bottou)post
http://giorgiopatrini.org/posts/2017/09/06/in-search-of-the-missing-signals/notes
http://www.shortscience.org/paper?bibtexKey=journals/corr/Lopez-PazNCSB16
"Learning Representations for Counterfactual Inference" Johansson, Shalit, Sontag
"Observational studies are rising in importance due to the widespread accumulation of data in fields such as healthcare, education, employment and ecology. We consider the task of answering counterfactual questions such as, "Would this patient have lower blood sugar had she received a different medication?". We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art."
"In this paper we focus on counterfactual inference, which is a widely applicable special case of causal inference. We cast counterfactual inference as a type of domain adaptation problem, and derive a novel way of learning representations suited for this problem. Our models rely on a novel type of regularization criteria: learning balanced representations, representations which have similar distributions among the treated and untreated populations. We show that trading off a balancing criterion with standard data fitting and regularization terms is both practically and theoretically prudent. Open questions which remain are how to generalize this method for cases where more than one treatment is in question, deriving better optimization algorithms and using richer discrepancy measures."
video
http://techtalks.tv/talks/learning-representations-for-counterfactual-inference/62489/ (Johansson)video
https://channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016/Deep-Learning-Symposium-Session-3 (Shalit)notes
http://www.shortscience.org/paper?bibtexKey=journals/corr/JohanssonSS16code
https://github.com/clinicalml/cfrnet
"Causal Effect Inference with Deep Latent-Variable Models" Louizos, Shalit, Mooij, Sontag, Zemel, Welling
"Learning individual-level causal effects from observational data, such as inferring the most effective medication for a specific patient, is a problem of growing importance for policy makers. The most important aspect of inferring causal effects from observational data is the handling of confounders, factors that affect both an intervention and its outcome. A carefully designed observational study attempts to measure all important confounders. However, even if one does not have direct access to all confounders, there may exist noisy and uncertain measurement of proxies for confounders. We build on recent advances in latent variable modeling to simultaneously estimate the unknown latent space summarizing the confounders and the causal effect. Our method is based on Variational Autoencoders which follow the causal structure of inference with proxies. We show our method is significantly more robust than existing methods, and matches the state-of-the-art on previous benchmarks focused on individual treatment effects."
"Progress in probabilistic generative models has accelerated, developing richer models with neural architectures, implicit densities, and with scalable algorithms for their Bayesian inference. However, there has been limited progress in models that capture causal relationships, for example, how individual genetic factors cause major human diseases. In this work, we focus on two challenges in particular: How do we build richer causal models, which can capture highly nonlinear relationships and interactions between multiple causes? How do we adjust for latent confounders, which are variables influencing both cause and effect and which prevent learning of causal relationships? To address these challenges, we synthesize ideas from causality and modern probabilistic modeling. For the first, we describe implicit causal models, a class of causal models that leverages neural architectures with an implicit density. For the second, we describe an implicit causal model that adjusts for confounders by sharing strength across examples. In experiments, we scale Bayesian inference on up to a billion genetic measurements. We achieve state of the art accuracy for identifying causal factors: we significantly outperform existing genetics methods by an absolute difference of 15-45.3%."
video
https://vimeo.com/253922904 (Tran)video
https://youtube.com/watch?v=gi2jZ_bVJuA (Tran)slides
http://dustintran.com/talks/Tran_Genomics.pdfpost
https://www.alexdamour.com/blog/public/2018/05/18/non-identification-in-latent-confounder-models
"Learning Functional Causal Models with Generative Neural Networks" Goudet, Kalainathan, Caillou, Lopez-Paz, Guyon, Sebag, Tritas, Tubaro
CGNN
"We introduce a new approach to functional causal modeling from observational data. The approach, called Causal Generative Neural Networks, leverages the power of neural networks to learn a generative model of the joint distribution of the observed variables, by minimizing the Maximum Mean Discrepancy between generated and observed data. An approximate learning criterion is proposed to scale the computational cost of the approach to linear complexity in the number of observations. The performance of CGNN is studied throughout three experiments. First, we apply CGNN to the problem of cause-effect inference, where two CGNNs model P(Y|X,noise) and P(X|Y,noise) identify the best causal hypothesis out of X → Y and Y → X. Second, CGNN is applied to the problem of identifying v-structures and conditional independences. Third, we apply CGNN to problem of multivariate functional causal modeling: given a skeleton describing the dependences in a set of random variables {X1,…,Xd}, CGNN orients the edges in the skeleton to uncover the directed acyclic causal graph describing the causal structure of the random variables. On all three tasks, CGNN is extensively assessed on both artificial and real-world data, comparing favorably to the state-of-the-art. Finally, we extend CGNN to handle the case of confounders, where latent variables are involved in the overall causal model."
video
https://vimeo.com/252105914#t=37m10s (Goudet)code
https://github.com/GoudetOlivier/CGNNpaper
"Causal Generative Neural Networks" by Goudet et al.
"SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning" Kalainathan, Goudet, Guyon, Lopez-Paz, Sebag
"We present the Structural Agnostic Model, a framework to estimate end-to-end non-acyclic causal graphs from observational data. In a nutshell, SAM implements an adversarial game in which a separate model generates each variable, given real values from all others. In tandem, a discriminator attempts to distinguish between the joint distributions of real and generated samples. Finally, a sparsity penalty forces each generator to consider only a small subset of the variables, yielding a sparse causal graph. SAM scales easily to hundreds variables. Our experiments show the state-of-the-art performance of SAM on discovering causal structures and modeling interventions, in both acyclic and non-acyclic graphs."
"Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search" Buesing, Weber, Zwols, Racaniere, Guez, Lespiau, Heess
CF-GPS
counterfactual inference
ICLR 2019
Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, i.e. actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods."
"Instead of relying on data synthesized from scratch by a model, we train policies on model predictions of alternate outcomes of past experience from the true environment under counterfactual actions, i.e. actions that had not actually been taken, while everything else remaining the same. At the heart of CF-GPS are structural causal models which model the environment with two ingredients: 1) Independent random variables, called scenarios here, summarize all aspects of the environment that cannot be influenced by the agent. 2) Deterministic transition functions (also called causal mechanisms) take these scenarios, together with the agent’s actions, as input and produce the predicted outcome. The central idea of CF-GPS is that, instead of running an agent on scenarios sampled de novo from a model, we infer scenarios in hindsight from given off-policy data, and then evaluate and improve the agent on these specific scenarios using given or learned causal mechanisms."
"We show that CF-GPS generalizes and empirically improves on a vanilla model-based RL algorithm, by mitigating model mismatch via “grounding” or “anchoring” model-based predictions in inferred scenarios. As a result, this approach explicitly allows to trade-off historical data for model bias. CF-GPS differs substantially from standard off-policy RL algorithms based on Importance Sampling, where historical data is re-weighted with respect to the importance weights to evaluate or learn new policies. In contrast, CF-GPS explicitly reasons counterfactually about given off-policy data."
"We formulate model-based RL in POMDPs in terms of structural causal models, thereby connecting concepts from reinforcement learning and causal inference."
"We provide the first results, to the best of our knowledge, showing that counterfactual reasoning in structural causal models on off-policy data can facilitate solving non-trivial RL tasks."
"We show that two previously proposed classes of RL algorithms, namely Guided Policy Search and Stochastic Value Gradient methods can be interpreted as counterfactual methods, opening up possible generalizations."
"Simulating plausible synthetic experience de novo is a hard problem for many environments, often resulting in biases for model-based RL algorithms. The main takeaway from this work is that we can improve policy learning by evaluating counterfactual actions in concrete, past scenarios. Compared to only considering synthetic scenarios, this procedure mitigates model bias."
"We assumed that there are no additional hidden confounders in the environment and that the main challenge in modelling the environment is capturing the distribution of the noise sources p(U), whereas we assumed that the transition and reward kernels given the noise is easy to model. This seems a reasonable assumption in some environments, such as the partially observed grid-world considered here, but not all. Probably the most restrictive assumption is that we require the inference over the noise U given data hT to be sufficiently accurate. We showed in our example, that we could learn a parametric model of this distribution from privileged information, i.e. from joint samples u, hT from the true environment. However, imperfect inference over the scenario U could result e.g. in wrongly attributing a negative outcome to the agent’s actions, instead environment factors. This could in turn result in too optimistic predictions for counterfactual actions. Future research is needed to investigate if learning a sufficiently strong SCM is possible without privileged information for interesting RL domains. If, however, we can trust the transition and reward kernels of the model, we can substantially improve model-based RL methods by counterfactual reasoning on off-policy data, as demonstrated in our experiments and by the success of Guided Policy Search and Stochastic Value Gradient methods."
"The proposed approach here is general but only instantiated (in terms of inference algorithms and experiments) for when the initial starting state is unknown in a deterministic POMDP environment, where the dynamics and reward model is known. The authors show that they can use inference over the full trajectory (or some multi-time-step subpart) to get a (often delta function) posterior over the initial starting state, which then allows them to build a more accurate initial state distribution for use in their model simulations than approaches that do not use more than 1 step to do so. This is interesting, but it’s not quite clear where this sort of situation would arise in practice, and the proposed experimental results are limited to one simulated toy domain."
"Causal Reasoning from Meta-reinforcement Learning" Dasgupta et al.
"Discovering and exploiting the causal structure in the environment is a crucial challenge for intelligent agents. Here we explore whether causal reasoning can emerge via meta-reinforcement learning. We train a recurrent network with model-free reinforcement learning to solve a range of problems that each contain causal structure. We find that the trained agent can perform causal reasoning in novel situations in order to obtain rewards. The agent can select informative interventions, draw causal inferences from observational data, and make counterfactual predictions. Although established formal causal reasoning algorithms also exist, in this paper we show that such reasoning can arise from model-free reinforcement learning, and suggest that causal reasoning in complex settings may benefit from the more end-to-end learning-based approaches presented here. This work also offers new strategies for structured exploration in reinforcement learning, by providing agents with the ability to perform - and interpret - experiments."
"Agents trained in this manner performed causal reasoning in three data settings: observational, interventional, and counterfactual. Our approach did not require explicit encoding of formal principles of causal inference. Rather, by optimizing an agent to perform a task that depended on causal structure, the agent learned implicit strategies to generate and use different kinds of available data for causal reasoning, including drawing causal inferences from passive observation, actively intervening, and making counterfactual predictions, all on held out causal CBNs that the agents had never previously seen. A consistent result in all three data settings was that our agents learned to perform good experiment design or active learning. That is, they learned a non-random data collection policy where they actively chose which nodes to intervene (or condition) on in the information phase, and thus could control the kinds of data they saw, leading to higher performance in the quiz phase than that from an agent with a random data collection policy."
"We showed that agents learned to perform do-calculus. We saw that, the trained agent with access to only observational data received more reward than the highest possible reward achievable without causal knowledge. We further observed that this performance increase occurred selectively in cases where do-calculus made a prediction distinguishable from the predictions based on correlations – i.e. where the externally intervened node had a parent, meaning that the intervention resulted in a different graph."
"We showed that agents learned to resolve unobserved confounders using interventions (which is impossible with only observational data). We saw that agents with access to interventional data performed better than agents with access to only observational data only in cases where the intervened node shared an unobserved parent (a confounder) with other variables in the graph."
"We showed that agents learned to use counterfactuals. We saw that agents with additional access to the specific randomness in the test phase performed better than agents with access to only interventional data. We found that the increased performance was observed only in cases where the maximum mean value in the graph was degenerate, and optimal choice was affected by the latent randomness – i.e. where multiple nodes had the same value on average and the specific randomness could be used to distinguish their actual values in that specific case."
"General Identifiability with Arbitrary Surrogate Experiments" Lee, Correa, Bareinboim
UAI 2019
"We study the problem of causal identification from an arbitrary collection of observational and experimental distributions, and substantive knowledge about the phenomenon under investigation, which usually comes in the form of a causal graph. We call this problem g-identifiability, or gID for short. The gID setting encompasses two well-known problems in causal inference, namely, identifiability and z-identifiability — the former assumes that an observational distribution is necessarily available, and no experiments can be performed, conditions that are both relaxed in the gID setting; the latter assumes that all combinations of experiments are available, i.e., the power set of the experimental set Z, which gID does not require a priori. In this paper, we introduce a general strategy to prove non-gID based on hedgelets and thickets, which leads to a necessary and sufficient graphical condition for the corresponding decision problem. We further develop a procedure for systematically computing the target effect, and prove that it is sound and complete for gID instances. In other words, failure of the algorithm in returning an expression implies that the target effect is not computable from the available distributions. Finally, as a corollary of these results, we show that do-calculus is complete for the task of g-identifiability."
"In one line of investigation, this task is formalized through the question of whether the effect that an intervention on a set of variables X will have on another set of outcome variables Y (denoted Px(y)) can be uniquely computed from the probability distribution P over the observed variables V and a causal diagram G. This is known as the problem of identification, and has received great attention in the literature, starting with a number of sufficient conditions, and culminating in a complete graphical and algorithmic characterization. Despite the generality of such results, it’s the case that in some real-world applications the quantity Px(y) is not identifiable (i.e., not uniquely computable) from the observational data and the causal diagram."
"On an alternative thread in the literature, causal effects (Px(y)) are obtained directly through controlled experimentation. In the biomedical sciences, for instance, considerable resources are spent every year by the FDA, the NIH, and others, in supporting large-scale, systematic, and controlled experimentation, which comes under the rubric of Randomized Controlled Trials. The same method is also leveraged in the context of reinforcement learning, for example, when an autonomous agent is deployed in an environment and is given the capability of performing interventions and observing how they unfold in time. Through this process, experimental data is gathered, and used in the construction of a strategy, also known as policy, with the goal of optimizing the agent’s cumulative reward (e.g., survival, profitability, happiness). Despite all the inferential power entailed by this approach, there are real-world settings where controlling the variables in X is not feasible, possibly due to economical, technical, or ethical constraints."
"In this paper, we note that these two approaches can be seen as extremes in a spectrum of possible research designs, which can be combined to solve very natural, albeit non-trivial, causal inference problems. In fact, this generalized setting has been investigated in the literature under the rubric of z-identifiability (zID, for short). Formally, zID asks whether Px(y) can be uniquely computed from the combination of the observational distribution P(V) and the experimental distributions Pz'(V), for all Z'⊆ Z for some Z ⊆ V."