Automatic integration and probabilistic programming in the spirit of Church and Anglican.
- Is there a way to compile PGMs to probabilistic circuits?
- Is there a algebra that unifies [back/belief/expectation/equilibrium]-prop on PGMs/PCs?
- Is there a family of functions which is closed over differentiation and integration?
- Is there a tractable inversion sampling procedure for higher dimensional quantiles?
- Is there a way to perform inference on Bayesian networks using backprop?
- Is there a formula for propagating uncertainty through elementary functions?
The way we teach probability in school is too abstract. First we learn about probability distributions. Where did these distributions come from? How did they arise? To gain a better understanding of probabilistic reasoning, we should start from first principles with an application firmly in mind, and build up our intuition through simple programming exercises.
Why does probability exist? Probability exists because we have imperfect information about the world. Whether due to inaccessibility, poor visibility, measurement error or other sources, our sensing instruments are only capable of gathering a blurry representation of their environment. We are thus permanently stranded in a state of uncertainty about the world around us. Nevertheless, we would like to reconstruct and reason about possible realities to make informed decisions.
Most computers are by contrast, deterministic and fully observable. How could we use a deterministic machine to simulate a stochastic one? We must first have a source of noise. This may come from a the outside world through a special input device called a random number generator (RNG), or lacking one, from an artificial or pseudo-RNG (PRNG). A good PRNG will be indistinguishable from an RNG to any observer who does not know its internal state, and whose internal state cannot be easily guessed by observing the outputs. Many suitable candidates have been proposed.
What is a probability distribution? A probability distribution is a set of elements, accompanied by the relative frequency which they occur. Suppose we have a foreign language which we do not understand. How could we learn it? First, we might read some text and gather various statistics, such as the alphabet size and how many times each symbol occurs. We could summarize this information in a data structure called a histogram. There are many useful algorithms for computing these summaries efficiently, exactly or approximately.
Now, suppose we want to sample from our probabilistic model. To do so, we could take our histogram and for each symbol, compute a running sum up to its histogram index, called a cumulative distribution function (CDF). We draw a sample from our PRNG to get a number uniformly between 0 and 1 in increments of 1/|alphabet|, then select the symbol whose CDF is closest to that number and voila! we have obtained a sample.
In order to sample longer sequences, we might want to incorporate some context, such as pairs of adjacent characters. To do so, we could build a two-dimensional histogram, sample the first symbol from the "marginal" distribution P(Tโ=tโ), and the second from the "conditional" distribution P(Tโ=tโ|Tโ=tโ), the probability of the second character being tโ given the preceding character was tโ. This data structure is called a Markov or transition matrix.
P(Tโ=tโ,Tโ=tโ) = P(Tโ=tโ|Tโ=tโ)P(Tโ=tโ)
String: abcbbbbccbaโฆ
1 2 3 4 5 6 7 8 9 10 โฆ
Window: ab, bc, cb, bb, bb, bb, bc, cc, cb, ba, โฆ
Counts: 1 , 2 , 2 , 3 , 3 , 3 , 2 , 1 , 2 , 1 , โฆ
Transition matrix at index=10:
a b c โฆ
โโโโโโโโโโ
aโ 0 1 0
bโ 1 3 2
cโ 0 1 1
โฎ / 10
More generally, we might have longer windows containing triples or n-tuples of contiguous symbols. To represent longer contexts, we could record their probabilities into a multidimensional array or transition tensor, representing the probability of a subsequence tโtโ...tโ. This tensor is a probability distribution whose conditionals "slice" or disintegrate the tensor along a dimension, producing an n-1 dimensional hyperplane, the conditional probability of observing a given symbol in a given slot:
P(Tโ=tโ,Tโ=tโ,โฆ,Tโ=tโ) = P(Tโ=tโ|Tโโโ=tโโโ, Tโโโ=tโโโ, โฆ,Tโ=tโ),
where the tensor rank n is given by the context length, Tโ...โ are random variables and tโ...โ are their concrete instantiations. This tensor is a hypercube with shape |alphabet|โฟ - Each entry identifies a unique subsequence of n symbols, and the probability of observing them in the same context. Note the exponential state space of this model - as n grows larger, this will quickly require a very large amount of space to represent.
The first improvement we could make is to sparsify the tensor, i.e., only record its nonzero entries in some sort of list-based data structure, or sparse dictionary. Suppose in the previous example where n=2, we only stored nonzero entries as a list of pairs of bigrams to their frequency. By doing so, we could reduce the space consumption by 1/3. We could further reduce the space by only storing duplicate frequencies once. This would improve the space consumption by a small factor for multimodal distributions.
Matrix Sparse List Bidirectional Map
a b c โฆ
โโโโโโโโโโ (ab,ba,cc) <-> 1
aโ 0 1 0 [(ab,1), (bc,cb) <-> 2
bโ 1 3 2 (ba,1),(bb,3),(bc,2) (bb) <-> 3
cโ 0 2 1 (cb,2),(cc,1)] else -> 0
However, we can do even better! Since the prefixes *b
and *c
occur more than once, we could store the transition counts as a prefix tree of pairs, whose first entry records the prefix and second records its frequency. Like before, we could compress this into a DAG to deduplicate leaves with equal-frequency. This might be depicted as follows:
Prefix Tree Prefix DAG
(*,10) (*,10)
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โโโโโโโโโโผโโโโโโโโโโโโโ
(a,1) (b,6) (c,3) aโโโ 6โโb โโโโโโโโโc
โ โโโโโโโผโโโโโโ โโโโดโโโ โ โ โโโโผโโโโโโโ โโโดโโ
(b,1) (a,1) (b,3) (c,2) (b,2) (c,1) b โ a b โ c b c
โ โโโโโโโโโโโโโโโโโโโโโโ
โโโโผโโโ โโโฌโโ โโโฌโโ
1 3 2
Space complexity, however important, is only part of the picture. Often the limiting factor in many data structures is the maximum speedup of parallelization. While concurrent tries and dictionaries are available, they are nontrivial to implement and have suboptimal scaling properties. A much more trivially scalable approach would be to recursively decompose the data into many disjoint subsets, summarize each one, and recombine the summaries. By designing the summary carefully, this process can be made embarrassingly parallel.
Mathematically, the structure we are looking for is something called a monoid. If the summary of interest can be computed in any order it is called a commutative monoid. Many summaries naturally exhibit this property: sum, min, max, top-k and various probability distributions. Summaries which can be decomposed and recombined in this fashion are embarrassingly parallelizable.
abcbbbbccba โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ โ
abcbbb bbccba โ
(*,5) + (*,5) = (*,10)
โโโโโโโโโโโโโผโโโโโโโโโโโโ โโโโโโโโโดโโโโโโโโโ โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ
(a,1) (b,3) (c,1) + (b,3) (c,2) = (a,1) (b,6) (c,3)
โ โโโโดโโโ โ โโโโโโโผโโโโโโ โโโโดโโโ โ โโโโโโโผโโโโโโ โโโโดโโโ
(b,1) (b,2) (c,1) (b,1) + (a,1) (b,1) (c,1) (b,1) (c,1) = (b,1) (a,1) (b,3) (c,2) (b,2) (c,1)
So far, we have considered exact methods. What if we didn't care about estimating the exact transition probability, but only approximating it. How could we achieve that? Perhaps by using a probabilistic data structure, we could reduce the complexity even further.
One approach would be to use an approximate counting algorithm, or sketch-based summary. Sketches are probabilistic datastructures for approximately computing some statistic efficiently. Without going into the details, sketching algorithms are designed to smoothly trade off error-bounds for space-efficiency and can be used to compute a summary statistic over a very large number of items. Even with a very low error tolerance, we can often obtain dramatic reduction in space complexity.
What about sample complexity? In many cases, we are not constrained by space or time, but samples. In many high-dimensional settings, even if we had an optimally-efficient sparse encoding, obtaining a faithful approximation to the true distribution would require more data than we could plausibly obtain. How could we do better in terms of sample efficiency? We need two things: (1) inductive priors and (2) learnable parameters. This is where algebraic structure, like groups, rings and their cousins come in handy.
If we squint a little, neural networks are a bit like a mergable summaries which deconstruct their inputs and recombine them in specific ways. For images, we have the special Euclidean group, SE(2). There are many other groups which are interesting to consider in various domains. By constructing our models with these invariants, we can recover latent structure with far, far fewer samples than would be required by a naive encoding scheme. For our purposes, we are particularly interested in semiring algebras, a specific kind of algebra that may be employed to compute many useful properties about graphs such as their longest, shortest and widest paths.
TODO.
Suppose we have two Gaussian distributions with known parameters and want to combine them somehow:
How could we combine them to form a new distribution? We could simply average their densities:
But this might not be a valid operation depending on the units. We could "mix" them by flipping a coin:
But the mean of the mixture might not give the mean of the two datasets. Or we could multiply the PDFs:
Two Gaussian distributions, when multiplied together form another Gaussian! This is a very nice property.
Now we do not need to sample from the parents, but can discard them and sample directly from the child!
- Stable distributions are closed under convolution and linear combination of their random variables.
- A distribution is called infinitely divisible if it can be expressed as the sum of an arbitrary number of IID RVs.
- Gaussian distributions form a monoid.
We can use these algebraic properties to significantly simplify certain mixture distributions.
See notebook for further implementation details.
- miniKanren as a Tool for Symbolic Computation in Python, Willard (2020)
- Symbolic Exact Inference for Discrete Probabilistic Programs, Holtzen et al. (2019)
- APPL: A Probability Programming Language, Glen et al. (2012)
- Symbolic Statistics with SymPy, Rocklin (2010)
- PSI: Exact Symbolic Inference for Probabilistic Programs, Gehr et al. (2016)
- ฮปPSI: Exact Inference for Higher-Order Probabilistic Programs, Gehr et al. (2020)
- Simplifying Probabilistic Programs Using Computer Algebra, Carette and Shan (2016)
- Symbolic Maximum Likelihood Estimation with Mathematica, Rose and Smith (2001)
- A New Method for Efficient Symbolic Propagation in Discrete Bayesian Networks, Castillo and Gutierrez (1996)
- Theano: A Python framework for fast computation of mathematical expressions, Al-Rfou et al. (2016)
- A Class of Probability Distributions that is Closed with Respect to Addition as Well as Multiplication of Independent Random Variables, Bondesson (2015)
- The Harmonic Logarithms and the Binomial Formula, Roman (1993)
- A New Distribution on the Simplex with Auto-Encoding Applications, Stirn et al. (2019)
- Closed Form Integration of Artificial Neural Networks, Gottschling (1999)
- Families of infinitely divisible distributions closed under mixing and convolution, Keilson & Steutel (1972)
- Semiring Programming: A Declarative Framework for Generalized Sum Product Problems, Belle and De Raedt (2020)
- Algebraic Model Counting, Kimmig et al. (2012)
- The Generalized Distributive Law, Aji and McEliece (2000)
- A Logical Approach for Factoring Belief Networks, Darwiche (2002)
- Algebra of inference in graphical models revisited, Ravanbakhsh and Greiner (2014)
- Boolean Matrix Factorization and Noisy Completion via Message Passing, Ravanbakhsh et al. (2016)
- Message Passing and Combinatorial Optimization, Ravanbakhsh (2015)
- Bayesian Boolean Matrix Factorisation, Rukat et al. (2017)
- Methods and Applications of (max,+) Linear Algebra, Gaubert (2006)
- A New Algebra for the Treatment of Markov Models, Barfoot and D'Eleuterio (2003)
- Affine Algebraic Decision Diagrams and their Application to Structured Probabilistic Inference, Sanner and McAllester (2005)
- An algebraic model for the propagation of errors in matrix calculus, Tran (2019)
- Expectation propagation for t-Exponential family using q-algebra, Futami et al. (2017)
- A computational system for uncertainty propagation of measurement results, Mari (2009)
- Notes on the Use of Propagation of Error Formulas, Ku (1966)
- Calculating measurement uncertainty using automatic differentiation, Hall (2001)
- Uncertainty of Measurement: A Review of the Rules for Calculating Uncertainty Components through Functional Relationships, Farrance and Frenkel (2012)
- Propagating Uncertainty in Instrumentation Systems, Hall (2005)
- Object-oriented software for evaluating measurement uncertainty, Hall (2013)
- Propagation of uncertainty: Expressions of second and third order uncertainty with third and fourth moments, Mekid and Vaja (2008)
- Propagation of errors for matrix inversion, Lefebvre (2000)
- Error propagation in Runge-Kutta methods, Spijker (1996)
- Error propagation and algorithms, Clift and Murfet (2019)
- Propagating Covariance in Computer Vision, Haralick (1994)
- Propagation of Probabilities, Means, and Variances in Mixed Graphical Association Models, Lauritzen (1992)
- Uncertainty Propagation in Data Processing Systems, Manousakis et al. (2018)
- Propagation of Uncertainty through Mathematical Operations
- A Summary of Error Propagation
- An Introduction To Error Propagation, Arras (1998)
- Efficient Parallel Random Sampling โ Vectorized, Cache-Efficient, and Online, Sanders et al. (2019)
- Augur: Data-Parallel Probabilistic Modeling, Tristan et al. (2014)
- A Differential Approach to Inference in Bayesian Networks, Darwiche (2000)
- Scaling Exact Inference for Discrete Probabilistic Programs, Holtzen et al. (2020)
- Parallel Weighted Model Counting with Tensor Networks, Dudek and Vardi (2020)
- Faster Algorithms for Max-Product Message-Passing, McAuley and Caetano (2011)
- Fast inverse transform sampling in one and two dimensions, Olver and Townsend (2013)
- Fast and accurate parallel quantile computation, Luu (2016)
- Fast Random Integer Generation in an Interval, Lemire (2018)
- Fast Evaluation of Transcendental Functions, Karatsuba (1991)
- Tensor Train Spectral Method for Learning of Hidden Markov Models, Kuznetsov & Oseledets (2019)
- Tensor Balancing on Statistical Manifold, Sugiyama et al. (2018)
- Dynamic Multidimensional Histograms, Thaper et al. (2002)
- Space-Efficient Online Computation of Quantile Summaries, Greenwald and Khanna (2001)
- Frugal Streaming for Estimating Quantiles: One (or two) memory suffices, Ma et al. (2014)
- Smooth estimates of multiple quantiles in dynamically varying data streams, Hammer and Yazidi (2019)
- Mergeable summaries, Agarwal et al. (2012)
- Optimal Quantile Approximation in Streams, Karnin et al. (2019)
- Fast Concurrent Data Sketches, Rinberg et al. (2020)
- Data-Independent Space Partitionings for Summaries, Cormode et al. (2021)
- Small Summaries for Big Data, Cormode & Yi (2021)
- On the Relationship between Sum-Product Networks and Bayesian Networks, Zhao et al. (2015) [supplementary material]
- Probabilistic Circuits: A Unifying Framework for Tractable Probabilistic Models, Choi et al. (2020)
- Probabilistic Circuits: Representations, Inference, Learning and Theory, Vergari et al. (2020) [ECML-PKDD talk]
- Sum-product networks: A survey, Parรญs et al. (2020)
- Sum-Product Networks: A New Deep Architecture, Poon and Domingos (2012) [source code, user guide]
- Learning the Structure of Sum-Product Networks, Gens and Domingos (2013) [source code]
- Tractable Operations for Arithmetic Circuits of Probabilistic Models, Shen, Choi and Darwiche (2016)
- On Relaxing Determinism in Arithmetic Circuits, Choi and Darwiche (2017)
- Approximate Inference by Compilation to Arithmetic Circuits, Lowd and Domingos (2010)
- The Sum-Product Theorem: A Foundation for Learning Tractable Models, Friesen (2016)
- The Sum-Product Theorem and its Applications, Friesen (2016)
- Learning and Inference in Tractable Probabilistic Knowledge Bases, Niepert and Domingos (2015)
- Combining Sum-Product Network and Noisy-Or Model for Ontology Matching, Li (2015)
- On the Relationship between Sum-Product Networks and Bayesian Networks, Zhao et al. (2015) [slides]
- Two Reformulation Approaches to Maximum-A-Posteriori Inference in Sum-Product Networks, Maua et al. (2020)
- Sum-Product Graphical Models, Desana and Schnรถrr (2020)
- Foundations of Probabilistic Programming
- An Introduction to Probabilistic Programming, van de Meent et al. (2018)
- Exact Bayesian Inference by Symbolic Disintegration, Shan and Ramsey (2017)
- Apache DataSketches: A library of stochastic streaming algorithms, Rhodes et al. (2021) [source code]
- DDSketch: A Fast and Fully-Mergeable Quantile Sketch, Masson et al. (2019) [source code]
- Summingbird: a framework for integrating batch and online MapReduce computations, Boykin et al. (2014) [source code]
- Computing Extremely Accurate Quantiles Using t-Digests, Dunning and Ertl (2019) [source code]
- SPFlow: A library for deep probabilistic learning using SPNs, Molina et al. (2019) [source code]
- Dimple: Java and Matlab libraries for probabilistic inference, Barber et al. (2016) [source code]
- Chimple: a Probabilistic Programming Java API, Hershey et al. (2016) [source code]
- CREMA: A Java Library for Credal Network Inference, Huber et al. (2020) [source code]
- CREDICI: A Java Library for Causal Inference by Credal Networks, Cabaรฑas et al. (2020) [source code]
- JavaBayes: Bayesian networks in Java, Cozman (2001) [source code]
- Theano-PyMC, Willard (2020) [source code]
- MonteCarloMeasurements.jl: Nonlinear Propagation of Arbitrary Multivariate Distributions by means of Method Overloading, Carlson (2020) [source code]
- Part-of-Speech Tagging in Kotlin, Khalusova (2021) [source code]