From 8e0059bfdd6284cc9389bf66bc7cfc9f14ce987a Mon Sep 17 00:00:00 2001
From: thomassargent30 <ts43@nyu.edu>
Date: Wed, 16 Feb 2022 16:54:23 -0700
Subject: [PATCH] Tom's Feb 16 edits of two lectures

---
 lectures/likelihood_ratio_process.md |    4 +-
 lectures/multivariate_normal.md      | 1175 +++++++++++++-------------
 2 files changed, 609 insertions(+), 570 deletions(-)

diff --git a/lectures/likelihood_ratio_process.md b/lectures/likelihood_ratio_process.md
index 92cd40c26..b31555b48 100644
--- a/lectures/likelihood_ratio_process.md
+++ b/lectures/likelihood_ratio_process.md
@@ -232,7 +232,7 @@ Mathematical induction implies
 $E\left[L\left(w^{t}\right)\bigm|q=g\right]=1$ for all
 $t \geq 1$.
 
-## Peculiar Property of Likelihood Ratio Process
+## Peculiar Property
 
 How can $E\left[L\left(w^{t}\right)\bigm|q=g\right]=1$ possibly be true when most  probability mass of the likelihood
 ratio process is piling up near $0$ as
@@ -545,7 +545,7 @@ control tests during World War II.
 A Navy Captain who had been ordered to perform tests of this kind had doubts about it that he
 presented to Milton Friedman, as we describe in  {doc}`this lecture <wald_friedman>`.
 
-## Kullback–Leibler divergence
+## Kullback–Leibler Divergence
 
 Now let’s consider a case in which neither $g$ nor $f$
 generates the data.
diff --git a/lectures/multivariate_normal.md b/lectures/multivariate_normal.md
index 28f36e691..d8934dd9a 100644
--- a/lectures/multivariate_normal.md
+++ b/lectures/multivariate_normal.md
@@ -35,12 +35,12 @@ In this lecture, you will learn formulas for
 * marginal distributions for all subvectors of $x$
 * conditional distributions for subvectors of $x$ conditional on other subvectors of $x$
 
-We will use  the multivariate normal distribution to formulate some classic models:
+We will use  the multivariate normal distribution to formulate some useful models:
 
-* a **factor analytic model** of an intelligence quotient, i.e., IQ
-* a **factor analytic model** of two independent inherent abilities, mathematical and verbal.
+* a factor analytic model of an intelligence quotient, i.e., IQ
+* a factor analytic model of two independent inherent abilities, say, mathematical and verbal.
 * a more general factor analytic model
-* PCA as an approximation to a factor analytic model
+* Principal Components Analysis (PCA) as an approximation to a factor analytic model
 * time series generated by linear stochastic difference equations
 * optimal linear filtering theory
 
@@ -56,9 +56,9 @@ For a multivariate normal distribution it is very convenient that
 - conditional distributions are characterized by multivariate linear
   regressions
 
-We apply our Python class to some classic examples.
+We apply our Python class to some examples.
 
-We will use the following imports:
+We  use the following imports:
 
 ```{code-cell} ipython
 %matplotlib inline
@@ -82,6 +82,8 @@ where $\mu=Ez$ is the mean of the random vector $z$ and
 $\Sigma=E\left(z-\mu\right)\left(z-\mu\right)^\prime$ is the
 covariance matrix of $z$.
 
+The covariance matrix $\Sigma$ is symmetric and positive definite.
+
 ```{code-cell} python3
 @njit
 def f(z, μ, Σ):
@@ -110,10 +112,14 @@ def f(z, μ, Σ):
     return (2 * np.pi) ** (-N/2) * temp1 * temp2
 ```
 
-For some integer $k\in \{2,\dots, N-1\}$, partition
+For some integer $k\in \{1,\dots, N-1\}$, partition
 $z$ as
-$z=\left[\begin{array}{c} z_{1}\\ z_{2} \end{array}\right]$, where
-$z_1$ is an $\left(N-k\right)\times1$ vector and $z_2$
+
+$$ 
+z=\left[\begin{array}{c} z_{1}\\ z_{2} \end{array}\right],
+$$
+
+ where $z_1$ is an $\left(N-k\right)\times1$ vector and $z_2$
 is a $k\times1$ vector.
 
 Let
@@ -161,7 +167,7 @@ $$
 $$
 
 is an $\left(N-k\right) \times k$ matrix of **population
-regression coefficients** of $z_1 - \mu_1$ on $z_2 - \mu_2$.
+regression coefficients** of the $(N -k) \times 1$ random vector $z_1 - \mu_1$ on the $k \times 1$ random vector $z_2 - \mu_2$.
 
 The following class constructs a multivariate normal distribution
 instance with two methods.
@@ -251,7 +257,7 @@ trivariate example.
 We’ll compute population moments of some conditional distributions using
 our `MultivariateNormal` class.
 
-Then for fun we’ll compute sample analogs of the associated population
+For fun we’ll also compute sample analogs of the associated population
 regressions by generating simulations and then computing linear least
 squares regressions.
 
@@ -267,14 +273,14 @@ $$
 0\\
 0
 \end{array}\right],\quad\Sigma=\left[\begin{array}{cc}
-1 & .2\\
-.2 & 1
+1 & .5\\
+.5 & 2
 \end{array}\right]
 $$
 
 ```{code-cell} python3
 μ = np.array([0., 0.])
-Σ = np.array([[1., .2], [.2 ,1.]])
+Σ = np.array([[1., .5], [.5 ,2.]])
 
 # construction of the multivariate normal instance
 multi_normal = MultivariateNormal(μ, Σ)
@@ -288,7 +294,25 @@ multi_normal.partition(k)
 multi_normal.βs[0]
 ```
 
-Let’s compute the mean and variance of the distribution of $z_1$
+
+To illustrate the idea that you _can regress anything on anything else_, let's first compute the mean and variance of the distribution of $z_2$
+conditional on $z_1=5$.
+
+After that we'll reverse what are on the left and right sides of the regression.
+
+
+
+```{code-cell} python3
+# compute the cond. dist. of z1
+ind = 1
+z1 = np.array([5.]) # given z1
+
+μ2_hat, Σ2_hat = multi_normal.cond_dist(ind, z1)
+print('μ2_hat, Σ2_hat = ', μ2_hat, Σ2_hat)
+```
+
+
+Now let’s compute the mean and variance of the distribution of $z_1$
 conditional on $z_2=5$.
 
 ```{code-cell} python3
@@ -357,8 +381,8 @@ compare it with $\hat{\mu}_1$
 Thus, in each case, for our very large sample size, the sample analogues
 closely approximate their population counterparts.
 
-These close approximations are foretold by a version of a Law of Large
-Numbers.
+A Law of Large
+Numbers explains why  sample analogues approximate  population objects.
 
 ## Trivariate Example
 
@@ -444,10 +468,10 @@ $$
 \theta = \mu_{\theta} + \sigma_{\theta} w_{n+1}.
 $$
 
-We assume the noise in the test scores is IID and not correlated with
+We assume that the noises $\{w_i\}_{i=1}^N$  in the test scores are IID and not correlated with
 IQ.
 
-In particular, we assume $\{w_i\}_{i=1}^{n+1}$ are i.i.d. standard
+We also assume that $\{w_i\}_{i=1}^{n+1}$ are i.i.d. standard
 normal:
 
 $$
@@ -461,7 +485,7 @@ w_{n+1}
 \end{array}\right]\sim N\left(0,I_{n+1}\right)
 $$
 
-The following system describes the random vector $X$ that
+The following system describes the $(n+1) \times 1$ random vector $X$ that
 interests us:
 
 $$
@@ -531,7 +555,7 @@ Assume we have recorded $50$ test scores and we know that
 $\mu_{\theta}=100$, $\sigma_{\theta}=10$, and
 $\sigma_{y}=10$.
 
-We can compute the mean vector and covariance matrix of $x$ easily
+We can compute the mean vector and covariance matrix of $X$ easily
 with our `construct_moments_IQ` function as follows.
 
 ```{code-cell} python3
@@ -546,6 +570,8 @@ We can now use our `MultivariateNormal` class to construct an
 instance, then partition the mean vector and covariance matrix as we
 wish.
 
+We want to regress IQ, the random variable $\theta$ (_what we don't know_), on the vector $y$  of test scores (_what we do know_). 
+
 We choose `k=n` so that $z_{1} = y$ and $z_{2} = \theta$.
 
 ```{code-cell} python3
@@ -572,11 +598,13 @@ y = x[:-1] # test scores
 θ
 ```
 
-The method `cond_dist` takes test scores as input and returns the
+The method `cond_dist` takes test scores $y$ as input and returns the
 conditional normal distribution of the IQ $\theta$.
 
-Note that now $\theta$ is what we denoted as $z_{2}$ in the
-general case so we need to set `ind=1`.
+In the following code, `ind` sets the variables on the right side of the regression.
+
+Given the way we have defined the vector $X$, we want  to set `ind=1` in order to make $\theta$ the left side variable in the
+population regression.
 
 ```{code-cell} python3
 ind = 1
@@ -586,7 +614,7 @@ multi_normal_IQ.cond_dist(ind, y)
 The first number is the conditional mean $\hat{\mu}_{\theta}$ and
 the second is the conditional variance $\hat{\Sigma}_{\theta}$.
 
-How do the additional test scores affect our inferences?
+How do additional test scores affect our inferences?
 
 To shed light on this, we compute a sequence of conditional
 distributions of $\theta$ by varying the number of test scores in
@@ -659,11 +687,11 @@ approach $\theta$.
 
 Thus, each $y_{i}$ adds information about $\theta$.
 
-If we drove the number of tests $n \rightarrow + \infty$, the
+If we were to drive the number of tests $n \rightarrow + \infty$, the
 conditional standard deviation $\hat{\sigma}_{\theta}$ would
-converge to $0$ at the rate $\frac{1}{n^{.5}}$.
+converge to $0$ at  rate $\frac{1}{n^{.5}}$.
 
-## Another representation
+## Information as Surprise
 
 By using a different representation, let’s look at things from a
 different perspective.
@@ -693,7 +721,9 @@ $$
 \epsilon \sim N(0, I) .
 $$
 
-Let $G=C^{-1}$; $G$ is also lower triangular.
+Let $G=C^{-1}$
+
+$G$ is also lower triangular.
 
 We can compute $\epsilon$ from the formula
 
@@ -776,13 +806,13 @@ np.max(np.abs(μθ_hat_arr - μθ_hat_arr_C)) < 1e-10
 np.max(np.abs(Σθ_hat_arr - Σθ_hat_arr_C)) < 1e-10
 ```
 
-## Magic of the Cholesky factorization
+## Cholesky Factor Magic
 
-Evidently, the Cholesky factorization is automatically computing the
+Evidently, the Cholesky factorizations automatically computes the
 population  **regression coefficients** and associated statistics
 that are produced by our `MultivariateNormal` class.
 
-The Cholesky factorization is  computing things **recursively**.
+The Cholesky factorization  computes these things **recursively**.
 
 Indeed, in formula {eq}`mnv_1`,
 
@@ -792,7 +822,7 @@ Indeed, in formula {eq}`mnv_1`,
 - the coefficient $c_i$ is the simple population regression
   coefficient of $\theta - \mu_\theta$ on $\epsilon_i$
 
-## Math and Verbal Components of Intelligence
+## Math and Verbal  Intelligence
 
 We can alter the preceding example to be more realistic.
 
@@ -1261,801 +1291,810 @@ y
 
 This example is an instance of what is known as a **Wold representation** in time series analysis.
 
-## Classic Factor Analysis Model
 
-The factor analysis model widely used in psychology and other fields can
-be represented as
+## Stochastic Difference Equation
+
+Consider the stochastic second-order linear difference equation
 
 $$
-Y = \Lambda f + U
+y_{t} = \alpha_{0} + \alpha_{1} y_{y-1} + \alpha_{2} y_{t-2} + u_{t}
 $$
 
-where
-
-1. $Y$ is $n \times 1$ random vector,
-   $E U U^{\prime} = D$ is a diagonal matrix,
-1. $\Lambda$ is $n \times k$ coefficient matrix,
-1. $f$ is $k \times 1$ random vector,
-   $E f f^{\prime} = I$,
-1. $U$ is $n \times 1$ random vector, and $U \perp f$.
-1. It is presumed that $k$ is small relative to $n$; often
-   $k$ is only $1$ or $2$, as in our IQ examples.
-
-This implies that
+where $u_{t} \sim N \left(0, \sigma_{u}^{2}\right)$ and
 
 $$
-\begin{aligned}
-\Sigma_y = E Y Y^{\prime} = \Lambda \Lambda^{\prime} + D \\
-E Y f^{\prime} = \Lambda \\
-E f Y^{\prime} = \Lambda^{\prime}
-\end{aligned}
+\left[\begin{array}{c}
+y_{-1}\\
+y_{0}
+\end{array}\right]\sim N\left(\mu_{\tilde{y}},\Sigma_{\tilde{y}}\right)
 $$
 
-Thus, the covariance matrix $\Sigma_Y$ is the sum of a diagonal
-matrix $D$ and a positive semi-definite matrix
-$\Lambda \Lambda^{\prime}$ of rank $k$.
+It can be written as a stacked system
 
-This means that all covariances among the $n$ components of the
-$Y$ vector are intermediated by their common dependencies on the
-$k<$ factors.
+$$
+\underset{\equiv A}{\underbrace{\left[\begin{array}{cccccccc}
+1 & 0 & 0 & 0 & \cdots & 0 & 0 & 0\\
+-\alpha_{1} & 1 & 0 & 0 & \cdots & 0 & 0 & 0\\
+-\alpha_{2} & -\alpha_{1} & 1 & 0 & \cdots & 0 & 0 & 0\\
+0 & -\alpha_{2} & -\alpha_{1} & 1 & \cdots & 0 & 0 & 0\\
+\vdots & \vdots & \vdots & \vdots & \cdots & \vdots & \vdots & \vdots\\
+0 & 0 & 0 & 0 & \cdots & -\alpha_{2} & -\alpha_{1} & 1
+\end{array}\right]}}\left[\begin{array}{c}
+y_{1}\\
+y_{2}\\
+y_{3}\\
+y_{4}\\
+\vdots\\
+y_{T}
+\end{array}\right]=\underset{\equiv b}{\underbrace{\left[\begin{array}{c}
+\alpha_{0}+\alpha_{1}y_{0}+\alpha_{2}y_{-1}\\
+\alpha_{0}+\alpha_{2}y_{0}\\
+\alpha_{0}\\
+\alpha_{0}\\
+\vdots\\
+\alpha_{0}
+\end{array}\right]}}
+$$
 
-Form
+We can compute $y$ by solving the system
 
 $$
-Z=\left(\begin{array}{c}
-f\\
-Y
-\end{array}\right)
+y = A^{-1} \left(b + u\right)
 $$
 
-the covariance matrix of the expanded random vector $Z$ can be
-computed as
+We have
 
 $$
-\Sigma_{z} = EZZ^{\prime}=\left(\begin{array}{cc}
-I & \Lambda^{\prime}\\
-\Lambda & \Lambda\Lambda^{\prime}+D
-\end{array}\right)
+\begin{aligned}
+\mu_{y} = A^{-1} \mu_{b} \\
+\Sigma_{y} &= A^{-1} E \left[\left(b - \mu_{b} + u \right) \left(b - \mu_{b} + u \right)^{\prime}\right] \left(A^{-1}\right)^{\prime} \\
+           &= A^{-1} \left(\Sigma_{b} + \Sigma_{u} \right) \left(A^{-1}\right)^{\prime}
+\end{aligned}
 $$
 
-In the following, we first construct the mean vector and the covariance
-matrix for the case where $N=10$ and $k=2$.
+where
 
-```{code-cell} python3
-N = 10
-k = 2
-```
+$$
+\mu_{b}=\left[\begin{array}{c}
+\alpha_{0}+\alpha_{1}\mu_{y_{0}}+\alpha_{2}\mu_{y_{-1}}\\
+\alpha_{0}+\alpha_{2}\mu_{y_{0}}\\
+\alpha_{0}\\
+\vdots\\
+\alpha_{0}
+\end{array}\right]
+$$
 
-We set the coefficient matrix $\Lambda$ and the covariance matrix
-of $U$ to be
+$$
+\Sigma_{b}=\left[\begin{array}{cc}
+C\Sigma_{\tilde{y}}C^{\prime} & \boldsymbol{0}_{N-2\times N-2}\\
+\boldsymbol{0}_{N-2\times2} & \boldsymbol{0}_{N-2\times N-2}
+\end{array}\right],\quad C=\left[\begin{array}{cc}
+\alpha_{2} & \alpha_{1}\\
+0 & \alpha_{2}
+\end{array}\right]
+$$
 
 $$
-\Lambda=\left(\begin{array}{cc}
-1 & 0\\
-\vdots & \vdots\\
-1 & 0\\
-0 & 1\\
-\vdots & \vdots\\
-0 & 1
-\end{array}\right),\quad D=\left(\begin{array}{cccc}
+\Sigma_{u}=\left[\begin{array}{cccc}
 \sigma_{u}^{2} & 0 & \cdots & 0\\
 0 & \sigma_{u}^{2} & \cdots & 0\\
 \vdots & \vdots & \vdots & \vdots\\
 0 & 0 & \cdots & \sigma_{u}^{2}
-\end{array}\right)
+\end{array}\right]
 $$
 
-where the first half of the first column of $\Lambda$ is filled
-with $1$s and $0$s for the rest half, and symmetrically
-for the second column. $D$ is a diagonal matrix with parameter
-$\sigma_{u}^{2}$ on the diagonal.
-
 ```{code-cell} python3
-Λ = np.zeros((N, k))
-Λ[:N//2, 0] = 1
-Λ[N//2:, 1] = 1
+# set parameters
+T = 80
+T = 160
+# coefficients of the second order difference equation
+𝛼0 = 10
+𝛼1 = 1.53
+𝛼2 = -.9
 
-σu = .5
-D = np.eye(N) * σu ** 2
+# variance of u
+σu = 1.
+σu = 10.
+
+# distribution of y_{-1} and y_{0}
+μy_tilde = np.array([1., 0.5])
+Σy_tilde = np.array([[2., 1.], [1., 0.5]])
 ```
 
 ```{code-cell} python3
-# compute Σy
-Σy = Λ @ Λ.T + D
-```
+# construct A and A^{\prime}
+A = np.zeros((T, T))
 
-We can now construct the mean vector and the covariance matrix for
-$Z$.
+for i in range(T):
+    A[i, i] = 1
 
-```{code-cell} python3
-μz = np.zeros(k+N)
+    if i-1 >= 0:
+        A[i, i-1] = -𝛼1
 
-Σz = np.empty((k+N, k+N))
+    if i-2 >= 0:
+        A[i, i-2] = -𝛼2
 
-Σz[:k, :k] = np.eye(k)
-Σz[:k, k:] = Λ.T
-Σz[k:, :k] = Λ
-Σz[k:, k:] = Σy
+A_inv = np.linalg.inv(A)
 ```
 
 ```{code-cell} python3
-z = np.random.multivariate_normal(μz, Σz)
+# compute the mean vectors of b and y
+μb = np.full(T, 𝛼0)
+μb[0] += 𝛼1 * μy_tilde[1] + 𝛼2 * μy_tilde[0]
+μb[1] += 𝛼2 * μy_tilde[1]
 
-f = z[:k]
-y = z[k:]
+μy = A_inv @ μb
 ```
 
 ```{code-cell} python3
-multi_normal_factor = MultivariateNormal(μz, Σz)
-multi_normal_factor.partition(k)
-```
+# compute the covariance matrices of b and y
+Σu = np.eye(T) * σu ** 2
 
-Let’s compute the conditional distribution of the hidden factor
-$f$ on the observations $Y$, namely, $f \mid Y=y$.
+Σb = np.zeros((T, T))
 
-```{code-cell} python3
-multi_normal_factor.cond_dist(0, y)
+C = np.array([[𝛼2, 𝛼1], [0, 𝛼2]])
+Σb[:2, :2] = C @ Σy_tilde @ C.T
+
+Σy = A_inv @ (Σb + Σu) @ A_inv.T
 ```
 
-We can verify that the conditional mean
-$E \left[f \mid Y=y\right] = B Y$ where
-$B = \Lambda^{\prime} \Sigma_{y}^{-1}$.
+## Application to Stock Price Model
 
-```{code-cell} python3
-B = Λ.T @ np.linalg.inv(Σy)
+Let
 
-B @ y
-```
+$$
+p_{t} = \sum_{j=0}^{T-t} \beta^{j} y_{t+j}
+$$
 
-Similarly, we can compute the conditional distribution $Y \mid f$.
+Form
 
-```{code-cell} python3
-multi_normal_factor.cond_dist(1, f)
-```
+$$
+\underset{\equiv p}{\underbrace{\left[\begin{array}{c}
+p_{1}\\
+p_{2}\\
+p_{3}\\
+\vdots\\
+p_{T}
+\end{array}\right]}}=\underset{\equiv B}{\underbrace{\left[\begin{array}{ccccc}
+1 & \beta & \beta^{2} & \cdots & \beta^{T-1}\\
+0 & 1 & \beta & \cdots & \beta^{T-2}\\
+0 & 0 & 1 & \cdots & \beta^{T-3}\\
+\vdots & \vdots & \vdots & \vdots & \vdots\\
+0 & 0 & 0 & \cdots & 1
+\end{array}\right]}}\left[\begin{array}{c}
+y_{1}\\
+y_{2}\\
+y_{3}\\
+\vdots\\
+y_{T}
+\end{array}\right]
+$$
 
-It can be verified that the mean is
-$\Lambda I^{-1} f = \Lambda f$.
+we have
+
+$$
+\begin{aligned}
+\mu_{p} = B \mu_{y} \\
+\Sigma_{p} = B \Sigma_{y} B^{\prime}
+\end{aligned}
+$$
 
 ```{code-cell} python3
-Λ @ f
+β = .96
 ```
 
-## PCA as Approximation to Factor Analytic Model
+```{code-cell} python3
+# construct B
+B = np.zeros((T, T))
 
-For fun, let’s apply a Principal Components Analysis (PCA) decomposition
-to a covariance matrix $\Sigma_y$ that in fact is governed by our factor-analytic
-model.
-
-Technically, this means that the PCA model is misspecified. (Can you
-explain why?)
-
-Nevertheless, this exercise will let us study how well the first two
-principal components from a PCA can approximate the conditional
-expectations $E f_i | Y$ for our two factors $f_i$,
-$i=1,2$ for the factor analytic model that we have assumed truly
-governs the data on $Y$ we have generated.
+for i in range(T):
+    B[i, i:] = β ** np.arange(0, T-i)
+```
 
-So we compute the PCA decomposition
+Denote
 
 $$
-\Sigma_{y} = P \tilde{\Lambda} P^{\prime}
+z=\left[\begin{array}{c}
+y\\
+p
+\end{array}\right]=\underset{\equiv D}{\underbrace{\left[\begin{array}{c}
+I\\
+B
+\end{array}\right]}} y
 $$
 
-where $\tilde{\Lambda}$ is a diagonal matrix.
-
-We have
+Thus, $\{y_t\}_{t=1}^{T}$ and $\{p_t\}_{t=1}^{T}$ jointly
+follow the multivariate normal distribution
+$N \left(\mu_{z}, \Sigma_{z}\right)$, where
 
 $$
-Y = P \epsilon
+\mu_{z}=D\mu_{y}
 $$
 
-and
-
 $$
-\epsilon = P^\prime Y
+\Sigma_{z}=D\Sigma_{y}D^{\prime}
 $$
 
-Note that we will arrange the eigenvectors in $P$ in the
-*descending* order of eigenvalues.
-
 ```{code-cell} python3
-𝜆_tilde, P = np.linalg.eigh(Σy)
-
-# arrange the eigenvectors by eigenvalues
-ind = sorted(range(N), key=lambda x: 𝜆_tilde[x], reverse=True)
-
-P = P[:, ind]
-𝜆_tilde = 𝜆_tilde[ind]
-Λ_tilde = np.diag(𝜆_tilde)
-
-print('𝜆_tilde =', 𝜆_tilde)
+D = np.vstack([np.eye(T), B])
 ```
 
 ```{code-cell} python3
-# verify the orthogonality of eigenvectors
-np.abs(P @ P.T - np.eye(N)).max()
+μz = D @ μy
+Σz = D @ Σy @ D.T
 ```
 
-```{code-cell} python3
-# verify the eigenvalue decomposition is correct
-P @ Λ_tilde @ P.T
-```
+We can simulate paths of $y_{t}$ and $p_{t}$ and compute the
+conditional mean $E \left[p_{t} \mid y_{t-1}, y_{t}\right]$ using
+the `MultivariateNormal` class.
 
 ```{code-cell} python3
-ε = P.T @ y
-
-print("ε = ", ε)
+z = np.random.multivariate_normal(μz, Σz)
+y, p = z[:T], z[T:]
 ```
 
 ```{code-cell} python3
-# print the values of the two factors
-
-print('f = ', f)
-```
-
-Below we’ll plot several things
+cond_Ep = np.empty(T-1)
 
-- the $N$ values of $y$
-- the $N$ values of the principal components $\epsilon$
-- the value of the first factor $f_1$ plotted only for the first
-  $N/2$ observations of $y$ for which it receives a
-  non-zero loading in $\Lambda$
-- the value of the second factor $f_2$ plotted only for the final
-  $N/2$ observations for which it receives a non-zero loading in
-  $\Lambda$
+sub_μ = np.empty(3)
+sub_Σ = np.empty((3, 3))
+for t in range(2, T+1):
+    sub_μ[:] = μz[[t-2, t-1, T-1+t]]
+    sub_Σ[:, :] = Σz[[t-2, t-1, T-1+t], :][:, [t-2, t-1, T-1+t]]
 
-```{code-cell} python3
-plt.scatter(range(N), y, label='y')
-plt.scatter(range(N), ε, label='$\epsilon$')
-plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
-plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
-plt.legend()
+    multi_normal = MultivariateNormal(sub_μ, sub_Σ)
+    multi_normal.partition(2)
 
-plt.show()
+    cond_Ep[t-2] = multi_normal.cond_dist(1, y[t-2:t])[0][0]
 ```
 
-Consequently, the first two $\epsilon_{j}$ correspond to the
-largest two eigenvalues.
-
-Let’s look at them, after which we’ll look at $E f | y = B y$
-
 ```{code-cell} python3
-ε[:2]
-```
+plt.plot(range(1, T), y[1:], label='$y_{t}$')
+plt.plot(range(1, T), y[:-1], label='$y_{t-1}$')
+plt.plot(range(1, T), p[1:], label='$p_{t}$')
+plt.plot(range(1, T), cond_Ep, label='$Ep_{t}|y_{t}, y_{t-1}$')
 
-```{code-cell} python3
-# compare with Ef|y
-B @ y
+plt.xlabel('t')
+plt.legend(loc=1)
+plt.show()
 ```
 
-The fraction of variance in $y_{t}$ explained by the first two
-principal components can be computed as below.
+In the above graph, the green line is what the price of the stock would
+be if people had perfect foresight about the path of dividends while the
+green line is the conditional expectation $E p_t | y_t, y_{t-1}$, which is what the price would
+be if people did not have perfect foresight but were optimally
+predicting future dividends on the basis of the information
+$y_t, y_{t-1}$ at time $t$.
 
-```{code-cell} python3
-𝜆_tilde[:2].sum() / 𝜆_tilde.sum()
-```
+## Filtering Foundations
 
-Compute
+Assume that $x_0$ is an $n \times 1$ random vector and that
+$y_0$ is a $p \times 1$ random vector determined by the
+*observation equation*
 
 $$
-\hat{Y} = P_{j} \epsilon_{j} + P_{k} \epsilon_{k}
+y_0 = G x_0 + v_0  , \quad x_0 \sim {\mathcal N}(\hat x_0, \Sigma_0), \quad v_0 \sim {\mathcal N}(0, R)
 $$
 
-where $P_{j}$ and $P_{k}$ correspond to the largest two
-eigenvalues.
-
-```{code-cell} python3
-y_hat = P[:, :2] @ ε[:2]
-```
-
-In this example, it turns out that the projection $\hat{Y}$ of
-$Y$ on the first two principal components does a good job of
-approximating $Ef \mid y$.
-
-We confirm this in the following plot of $f$,
-$E y \mid f$, $E f \mid y$, and $\hat{y}$ on the
-coordinate axis versus $y$ on the ordinate axis.
-
-```{code-cell} python3
-plt.scatter(range(N), Λ @ f, label='$Ey|f$')
-plt.scatter(range(N), y_hat, label='$\hat{y}$')
-plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
-plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
-
-Efy = B @ y
-plt.hlines(Efy[0], 0, N//2-1, ls='--', color='b', label='$Ef_{1}|y$')
-plt.hlines(Efy[1], N//2, N-1, ls='-.', color='b', label='$Ef_{2}|y$')
-plt.legend()
-
-plt.show()
-```
+where $v_0$ is orthogonal to $x_0$, $G$ is a
+$p \times n$ matrix, and $R$ is a $p \times p$
+positive definite matrix.
 
-The covariance matrix of $\hat{Y}$ can be computed by first
-constructing the covariance matrix of $\epsilon$ and then use the
-upper left block for $\epsilon_{1}$ and $\epsilon_{2}$.
+We consider the problem of someone who 
 
-```{code-cell} python3
-Σεjk = (P.T @ Σy @ P)[:2, :2]
+* *observes* $y_0$
+*  does not observe $x_0$,
+*  knows $\hat x_0, \Sigma_0, G, R$ and therefore  the joint probability distribution of the vector $\begin{bmatrix} x_0 \cr y_0 \end{bmatrix}$ 
+* wants to infer $x_0$ from $y_0$ in light of what he knows about that
+joint probability distribution.
 
-Pjk = P[:, :2]
+Therefore, the person wants to construct the probability distribution of
+$x_0$ conditional on the random vector $y_0$.
 
-Σy_hat = Pjk @ Σεjk @ Pjk.T
-print('Σy_hat = \n', Σy_hat)
-```
+The joint distribution of
+$\begin{bmatrix} x_0 \cr y_0 \end{bmatrix}$ is multivariate normal
+${\mathcal N}(\mu, \Sigma)$ with
 
-## Stochastic Difference Equation
+$$
+\mu = \begin{bmatrix} \hat x_0 \cr G \hat x_0 \end{bmatrix} , \quad
+  \Sigma = \begin{bmatrix} \Sigma_0 & \Sigma_0 G' \cr
+                          G \Sigma_0 & G \Sigma_0 G' + R \end{bmatrix}
+$$
 
-Consider the stochastic second-order linear difference equation
+By applying an appropriate instance of the above formulas for the  mean vector $\hat \mu_1$ and covariance matrix
+$\hat \Sigma_{11}$ of $z_1$ conditional on $z_2$, we find that the probability distribution of
+$x_0$ conditional on $y_0$ is
+${\mathcal N}(\tilde x_0, \tilde \Sigma_0)$ where
 
 $$
-y_{t} = \alpha_{0} + \alpha_{1} y_{y-1} + \alpha_{2} y_{t-2} + u_{t}
+\begin{aligned} \beta_0  & = \Sigma_0 G' (G \Sigma_0 G' + R)^{-1} \cr
+\tilde x_0 & = \hat x_0 + \beta_0 ( y_0 - G \hat x_0) \cr
+ \tilde \Sigma_0 & = \Sigma_0 - \Sigma_0 G' (G \Sigma_0 G' + R)^{-1} G \Sigma_0
+  \end{aligned}
 $$
 
-where $u_{t} \sim N \left(0, \sigma_{u}^{2}\right)$ and
+### Step toward dynamics
+
+Now suppose that we are in a time series setting and that we have the
+one-step state transition equation
 
 $$
-\left[\begin{array}{c}
-y_{-1}\\
-y_{0}
-\end{array}\right]\sim N\left(\mu_{\tilde{y}},\Sigma_{\tilde{y}}\right)
+x_1 = A x_0 + C w_1 ,  \quad w_1 \sim {\mathcal N}(0, I )
 $$
 
-It can be written as a stacked system
+where $A$ is an $n \times n$ matrix and $C$ is an
+$n \times m$ matrix.
+
+It follows that the probability distribution of $x_1$ conditional
+on $y_0$ is
 
 $$
-\underset{\equiv A}{\underbrace{\left[\begin{array}{cccccccc}
-1 & 0 & 0 & 0 & \cdots & 0 & 0 & 0\\
--\alpha_{1} & 1 & 0 & 0 & \cdots & 0 & 0 & 0\\
--\alpha_{2} & -\alpha_{1} & 1 & 0 & \cdots & 0 & 0 & 0\\
-0 & -\alpha_{2} & -\alpha_{1} & 1 & \cdots & 0 & 0 & 0\\
-\vdots & \vdots & \vdots & \vdots & \cdots & \vdots & \vdots & \vdots\\
-0 & 0 & 0 & 0 & \cdots & -\alpha_{2} & -\alpha_{1} & 1
-\end{array}\right]}}\left[\begin{array}{c}
-y_{1}\\
-y_{2}\\
-y_{3}\\
-y_{4}\\
-\vdots\\
-y_{T}
-\end{array}\right]=\underset{\equiv b}{\underbrace{\left[\begin{array}{c}
-\alpha_{0}+\alpha_{1}y_{0}+\alpha_{2}y_{-1}\\
-\alpha_{0}+\alpha_{2}y_{0}\\
-\alpha_{0}\\
-\alpha_{0}\\
-\vdots\\
-\alpha_{0}
-\end{array}\right]}}
+x_1 | y_0 \sim {\mathcal N}(A \tilde x_0 , A \tilde \Sigma_0 A' + C C' )
 $$
 
-We can compute $y$ by solving the system
+Define
 
 $$
-y = A^{-1} \left(b + u\right)
+\begin{aligned} \hat x_1 & = A \tilde x_0 \cr
+               \Sigma_1 & = A \tilde \Sigma_0 A' + C C'
+\end{aligned}
 $$
 
-We have
+### Dynamic version
+
+Suppose now that for $t \geq 0$,
+$\{x_{t+1}, y_t\}_{t=0}^\infty$ are governed by the equations
 
 $$
 \begin{aligned}
-\mu_{y} = A^{-1} \mu_{b} \\
-\Sigma_{y} &= A^{-1} E \left[\left(b - \mu_{b} + u \right) \left(b - \mu_{b} + u \right)^{\prime}\right] \left(A^{-1}\right)^{\prime} \\
-           &= A^{-1} \left(\Sigma_{b} + \Sigma_{u} \right) \left(A^{-1}\right)^{\prime}
+x_{t+1} & = A x_t + C w_{t+1} \cr
+y_t & = G x_t + v_t
 \end{aligned}
 $$
 
-where
+where as before $x_0 \sim {\mathcal N}(\hat x_0, \Sigma_0)$,
+$w_{t+1}$ is the $t+1$th component of an i.i.d. stochastic
+process distributed as $w_{t+1} \sim {\mathcal N}(0, I)$, and
+$v_t$ is the $t$th component of an i.i.d. process
+distributed as $v_t \sim {\mathcal N}(0, R)$ and the
+$\{w_{t+1}\}_{t=0}^\infty$ and $\{v_t\}_{t=0}^\infty$
+processes are orthogonal at all pairs of dates.
 
-$$
-\mu_{b}=\left[\begin{array}{c}
-\alpha_{0}+\alpha_{1}\mu_{y_{0}}+\alpha_{2}\mu_{y_{-1}}\\
-\alpha_{0}+\alpha_{2}\mu_{y_{0}}\\
-\alpha_{0}\\
-\vdots\\
-\alpha_{0}
-\end{array}\right]
-$$
+The logic and
+formulas that we applied above imply that the probability distribution
+of $x_t$ conditional on
+$y_0, y_1, \ldots , y_{t-1} = y^{t-1}$ is
 
 $$
-\Sigma_{b}=\left[\begin{array}{cc}
-C\Sigma_{\tilde{y}}C^{\prime} & \boldsymbol{0}_{N-2\times N-2}\\
-\boldsymbol{0}_{N-2\times2} & \boldsymbol{0}_{N-2\times N-2}
-\end{array}\right],\quad C=\left[\begin{array}{cc}
-\alpha_{2} & \alpha_{1}\\
-0 & \alpha_{2}
-\end{array}\right]
+x_t | y^{t-1} \sim {\mathcal N}(A \tilde x_t , A \tilde \Sigma_t A' + C C' )
 $$
 
+where $\{\tilde x_t, \tilde \Sigma_t\}_{t=1}^\infty$ can be
+computed by iterating on the following equations starting from
+$t=1$ and initial conditions for
+$\tilde x_0, \tilde \Sigma_0$ computed as we have above:
+
 $$
-\Sigma_{u}=\left[\begin{array}{cccc}
-\sigma_{u}^{2} & 0 & \cdots & 0\\
-0 & \sigma_{u}^{2} & \cdots & 0\\
-\vdots & \vdots & \vdots & \vdots\\
-0 & 0 & \cdots & \sigma_{u}^{2}
-\end{array}\right]
+\begin{aligned} \Sigma_t & = A  \tilde \Sigma_{t-1} A' + C C' \cr
+               \hat x_t & = A \tilde x_{t-1} \cr
+\beta_t & = \Sigma_t G' (G \Sigma_t G' + R)^{-1} \cr
+\tilde x_t & = \hat x_t + \beta_t ( y_t - G \hat x_t) \cr
+ \tilde \Sigma_t & = \Sigma_t - \Sigma_t G' (G \Sigma_t G' + R)^{-1} G \Sigma_t
+  \end{aligned}
 $$
 
+We can use the Python class *MultivariateNormal* to construct examples.
+
+Here is an example for a single period problem at time $0$
+
 ```{code-cell} python3
-# set parameters
-T = 80
-T = 160
-# coefficients of the second order difference equation
-𝛼0 = 10
-𝛼1 = 1.53
-𝛼2 = -.9
+G = np.array([[1., 3.]])
+R = np.array([[1.]])
 
-# variance of u
-σu = 1.
-σu = 10.
+x0_hat = np.array([0., 1.])
+Σ0 = np.array([[1., .5], [.3, 2.]])
 
-# distribution of y_{-1} and y_{0}
-μy_tilde = np.array([1., 0.5])
-Σy_tilde = np.array([[2., 1.], [1., 0.5]])
+μ = np.hstack([x0_hat, G @ x0_hat])
+Σ = np.block([[Σ0, Σ0 @ G.T], [G @ Σ0, G @ Σ0 @ G.T + R]])
 ```
 
 ```{code-cell} python3
-# construct A and A^{\prime}
-A = np.zeros((T, T))
-
-for i in range(T):
-    A[i, i] = 1
+# construction of the multivariate normal instance
+multi_normal = MultivariateNormal(μ, Σ)
+```
 
-    if i-1 >= 0:
-        A[i, i-1] = -𝛼1
+```{code-cell} python3
+multi_normal.partition(2)
+```
 
-    if i-2 >= 0:
-        A[i, i-2] = -𝛼2
+```{code-cell} python3
+# the observation of y
+y0 = 2.3
 
-A_inv = np.linalg.inv(A)
+# conditional distribution of x0
+μ1_hat, Σ11 = multi_normal.cond_dist(0, y0)
+μ1_hat, Σ11
 ```
 
 ```{code-cell} python3
-# compute the mean vectors of b and y
-μb = np.full(T, 𝛼0)
-μb[0] += 𝛼1 * μy_tilde[1] + 𝛼2 * μy_tilde[0]
-μb[1] += 𝛼2 * μy_tilde[1]
+A = np.array([[0.5, 0.2], [-0.1, 0.3]])
+C = np.array([[2.], [1.]])
 
-μy = A_inv @ μb
+# conditional distribution of x1
+x1_cond = A @ μ1_hat
+Σ1_cond = C @ C.T + A @ Σ11 @ A.T
+x1_cond, Σ1_cond
 ```
 
+### Code for Iterating
+
+Here is code for solving a dynamic filtering problem by iterating on our
+equations, followed by an example.
+
 ```{code-cell} python3
-# compute the covariance matrices of b and y
-Σu = np.eye(T) * σu ** 2
+def iterate(x0_hat, Σ0, A, C, G, R, y_seq):
 
-Σb = np.zeros((T, T))
+    p, n = G.shape
 
-C = np.array([[𝛼2, 𝛼1], [0, 𝛼2]])
-Σb[:2, :2] = C @ Σy_tilde @ C.T
+    T = len(y_seq)
+    x_hat_seq = np.empty((T+1, n))
+    Σ_hat_seq = np.empty((T+1, n, n))
 
-Σy = A_inv @ (Σb + Σu) @ A_inv.T
+    x_hat_seq[0] = x0_hat
+    Σ_hat_seq[0] = Σ0
+
+    for t in range(T):
+        xt_hat = x_hat_seq[t]
+        Σt = Σ_hat_seq[t]
+        μ = np.hstack([xt_hat, G @ xt_hat])
+        Σ = np.block([[Σt, Σt @ G.T], [G @ Σt, G @ Σt @ G.T + R]])
+
+        # filtering
+        multi_normal = MultivariateNormal(μ, Σ)
+        multi_normal.partition(n)
+        x_tilde, Σ_tilde = multi_normal.cond_dist(0, y_seq[t])
+
+        # forecasting
+        x_hat_seq[t+1] = A @ x_tilde
+        Σ_hat_seq[t+1] = C @ C.T + A @ Σ_tilde @ A.T
+
+    return x_hat_seq, Σ_hat_seq
 ```
 
-## Application to Stock Price Model
+```{code-cell} python3
+iterate(x0_hat, Σ0, A, C, G, R, [2.3, 1.2, 3.2])
+```
 
-Let
+The iterative algorithm just described is a version of the celebrated **Kalman filter**.
 
-$$
-p_{t} = \sum_{j=0}^{T-t} \beta^{j} y_{t+j}
-$$
+We describe the Kalman filter  and some applications of it in {doc}`A First Look at the Kalman Filter <kalman>`
 
-Form
+
+## Classic Factor Analysis Model
+
+The factor analysis model widely used in psychology and other fields can
+be represented as
 
 $$
-\underset{\equiv p}{\underbrace{\left[\begin{array}{c}
-p_{1}\\
-p_{2}\\
-p_{3}\\
-\vdots\\
-p_{T}
-\end{array}\right]}}=\underset{\equiv B}{\underbrace{\left[\begin{array}{ccccc}
-1 & \beta & \beta^{2} & \cdots & \beta^{T-1}\\
-0 & 1 & \beta & \cdots & \beta^{T-2}\\
-0 & 0 & 1 & \cdots & \beta^{T-3}\\
-\vdots & \vdots & \vdots & \vdots & \vdots\\
-0 & 0 & 0 & \cdots & 1
-\end{array}\right]}}\left[\begin{array}{c}
-y_{1}\\
-y_{2}\\
-y_{3}\\
-\vdots\\
-y_{T}
-\end{array}\right]
+Y = \Lambda f + U
 $$
 
-we have
+where
+
+1. $Y$ is $n \times 1$ random vector,
+   $E U U^{\prime} = D$ is a diagonal matrix,
+1. $\Lambda$ is $n \times k$ coefficient matrix,
+1. $f$ is $k \times 1$ random vector,
+   $E f f^{\prime} = I$,
+1. $U$ is $n \times 1$ random vector, and $U \perp f$.
+1. It is presumed that $k$ is small relative to $n$; often
+   $k$ is only $1$ or $2$, as in our IQ examples.
+
+This implies that
 
 $$
 \begin{aligned}
-\mu_{p} = B \mu_{y} \\
-\Sigma_{p} = B \Sigma_{y} B^{\prime}
+\Sigma_y = E Y Y^{\prime} = \Lambda \Lambda^{\prime} + D \\
+E Y f^{\prime} = \Lambda \\
+E f Y^{\prime} = \Lambda^{\prime}
 \end{aligned}
 $$
 
-```{code-cell} python3
-β = .96
-```
-
-```{code-cell} python3
-# construct B
-B = np.zeros((T, T))
+Thus, the covariance matrix $\Sigma_Y$ is the sum of a diagonal
+matrix $D$ and a positive semi-definite matrix
+$\Lambda \Lambda^{\prime}$ of rank $k$.
 
-for i in range(T):
-    B[i, i:] = β ** np.arange(0, T-i)
-```
+This means that all covariances among the $n$ components of the
+$Y$ vector are intermediated by their common dependencies on the
+$k<$ factors.
 
-Denote
+Form
 
 $$
-z=\left[\begin{array}{c}
-y\\
-p
-\end{array}\right]=\underset{\equiv D}{\underbrace{\left[\begin{array}{c}
-I\\
-B
-\end{array}\right]}} y
+Z=\left(\begin{array}{c}
+f\\
+Y
+\end{array}\right)
 $$
 
-Thus, $\{y_t\}_{t=1}^{T}$ and $\{p_t\}_{t=1}^{T}$ jointly
-follow the multivariate normal distribution
-$N \left(\mu_{z}, \Sigma_{z}\right)$, where
+the covariance matrix of the expanded random vector $Z$ can be
+computed as
 
 $$
-\mu_{z}=D\mu_{y}
+\Sigma_{z} = EZZ^{\prime}=\left(\begin{array}{cc}
+I & \Lambda^{\prime}\\
+\Lambda & \Lambda\Lambda^{\prime}+D
+\end{array}\right)
 $$
 
-$$
-\Sigma_{z}=D\Sigma_{y}D^{\prime}
-$$
+In the following, we first construct the mean vector and the covariance
+matrix for the case where $N=10$ and $k=2$.
 
 ```{code-cell} python3
-D = np.vstack([np.eye(T), B])
+N = 10
+k = 2
 ```
 
-```{code-cell} python3
-μz = D @ μy
-Σz = D @ Σy @ D.T
-```
+We set the coefficient matrix $\Lambda$ and the covariance matrix
+of $U$ to be
 
-We can simulate paths of $y_{t}$ and $p_{t}$ and compute the
-conditional mean $E \left[p_{t} \mid y_{t-1}, y_{t}\right]$ using
-the `MultivariateNormal` class.
+$$
+\Lambda=\left(\begin{array}{cc}
+1 & 0\\
+\vdots & \vdots\\
+1 & 0\\
+0 & 1\\
+\vdots & \vdots\\
+0 & 1
+\end{array}\right),\quad D=\left(\begin{array}{cccc}
+\sigma_{u}^{2} & 0 & \cdots & 0\\
+0 & \sigma_{u}^{2} & \cdots & 0\\
+\vdots & \vdots & \vdots & \vdots\\
+0 & 0 & \cdots & \sigma_{u}^{2}
+\end{array}\right)
+$$
+
+where the first half of the first column of $\Lambda$ is filled
+with $1$s and $0$s for the rest half, and symmetrically
+for the second column. 
+
+$D$ is a diagonal matrix with parameter
+$\sigma_{u}^{2}$ on the diagonal.
 
 ```{code-cell} python3
-z = np.random.multivariate_normal(μz, Σz)
-y, p = z[:T], z[T:]
+Λ = np.zeros((N, k))
+Λ[:N//2, 0] = 1
+Λ[N//2:, 1] = 1
+
+σu = .5
+D = np.eye(N) * σu ** 2
 ```
 
 ```{code-cell} python3
-cond_Ep = np.empty(T-1)
+# compute Σy
+Σy = Λ @ Λ.T + D
+```
 
-sub_μ = np.empty(3)
-sub_Σ = np.empty((3, 3))
-for t in range(2, T+1):
-    sub_μ[:] = μz[[t-2, t-1, T-1+t]]
-    sub_Σ[:, :] = Σz[[t-2, t-1, T-1+t], :][:, [t-2, t-1, T-1+t]]
+We can now construct the mean vector and the covariance matrix for
+$Z$.
 
-    multi_normal = MultivariateNormal(sub_μ, sub_Σ)
-    multi_normal.partition(2)
+```{code-cell} python3
+μz = np.zeros(k+N)
 
-    cond_Ep[t-2] = multi_normal.cond_dist(1, y[t-2:t])[0][0]
+Σz = np.empty((k+N, k+N))
+
+Σz[:k, :k] = np.eye(k)
+Σz[:k, k:] = Λ.T
+Σz[k:, :k] = Λ
+Σz[k:, k:] = Σy
 ```
 
 ```{code-cell} python3
-plt.plot(range(1, T), y[1:], label='$y_{t}$')
-plt.plot(range(1, T), y[:-1], label='$y_{t-1}$')
-plt.plot(range(1, T), p[1:], label='$p_{t}$')
-plt.plot(range(1, T), cond_Ep, label='$Ep_{t}|y_{t}, y_{t-1}$')
+z = np.random.multivariate_normal(μz, Σz)
 
-plt.xlabel('t')
-plt.legend(loc=1)
-plt.show()
+f = z[:k]
+y = z[k:]
 ```
 
-In the above graph, the green line is what the price of the stock would
-be if people had perfect foresight about the path of dividends while the
-green line is the conditional expectation $E p_t | y_t, y_{t-1}$, which is what the price would
-be if people did not have perfect foresight but were optimally
-predicting future dividends on the basis of the information
-$y_t, y_{t-1}$ at time $t$.
-
-## Filtering Foundations
-
-Assume that $x_0$ is an $n \times 1$ random vector and that
-$y_0$ is a $p \times 1$ random vector determined by the
-*observation equation*
-
-$$
-y_0 = G x_0 + v_0  , \quad x_0 \sim {\mathcal N}(\hat x_0, \Sigma_0), \quad v_0 \sim {\mathcal N}(0, R)
-$$
+```{code-cell} python3
+multi_normal_factor = MultivariateNormal(μz, Σz)
+multi_normal_factor.partition(k)
+```
 
-where $v_0$ is orthogonal to $x_0$, $G$ is a
-$p \times n$ matrix, and $R$ is a $p \times p$
-positive definite matrix.
+Let’s compute the conditional distribution of the hidden factor
+$f$ on the observations $Y$, namely, $f \mid Y=y$.
 
-We consider the problem of someone who *observes* $y_0$, who does
-not observe $x_0$, who knows $\hat x_0, \Sigma_0, G, R$ –
-and therefore knows the joint probability distribution of the vector
-$\begin{bmatrix} x_0 \cr y_0 \end{bmatrix}$ – and who wants to
-infer $x_0$ from $y_0$ in light of what he knows about that
-joint probability distribution.
+```{code-cell} python3
+multi_normal_factor.cond_dist(0, y)
+```
 
-Therefore, the person wants to construct the probability distribution of
-$x_0$ conditional on the random vector $y_0$.
+We can verify that the conditional mean
+$E \left[f \mid Y=y\right] = B Y$ where
+$B = \Lambda^{\prime} \Sigma_{y}^{-1}$.
 
-The joint distribution of
-$\begin{bmatrix} x_0 \cr y_0 \end{bmatrix}$ is multivariate normal
-${\mathcal N}(\mu, \Sigma)$ with
+```{code-cell} python3
+B = Λ.T @ np.linalg.inv(Σy)
 
-$$
-\mu = \begin{bmatrix} \hat x_0 \cr G \hat x_0 \end{bmatrix} , \quad
-  \Sigma = \begin{bmatrix} \Sigma_0 & \Sigma_0 G' \cr
-                          G \Sigma_0 & G \Sigma_0 G' + R \end{bmatrix}
-$$
+B @ y
+```
 
-By applying an appropriate instance of the above formulas for the  mean vector $\hat \mu_1$ and covariance matrix
-$\hat \Sigma_{11}$ of $z_1$ conditional on $z_2$, we find that the probability distribution of
-$x_0$ conditional on $y_0$ is
-${\mathcal N}(\tilde x_0, \tilde \Sigma_0)$ where
+Similarly, we can compute the conditional distribution $Y \mid f$.
 
-$$
-\begin{aligned} \beta_0  & = \Sigma_0 G' (G \Sigma_0 G' + R)^{-1} \cr
-\tilde x_0 & = \hat x_0 + \beta_0 ( y_0 - G \hat x_0) \cr
- \tilde \Sigma_0 & = \Sigma_0 - \Sigma_0 G' (G \Sigma_0 G' + R)^{-1} G \Sigma_0
-  \end{aligned}
-$$
+```{code-cell} python3
+multi_normal_factor.cond_dist(1, f)
+```
 
-### Step toward dynamics
+It can be verified that the mean is
+$\Lambda I^{-1} f = \Lambda f$.
 
-Now suppose that we are in a time series setting and that we have the
-one-step state transition equation
+```{code-cell} python3
+Λ @ f
+```
 
-$$
-x_1 = A x_0 + C w_1 ,  \quad w_1 \sim {\mathcal N}(0, I )
-$$
+## PCA and Factor Analysis
 
-where $A$ is an $n \times n$ matrix and $C$ is an
-$n \times m$ matrix.
+To learn about Principal Components Analysis (PCA), please see this lecture {doc}`Singular Value Decompositions <svd_intro>`.
 
-It follows that the probability distribution of $x_1$ conditional
-on $y_0$ is
+For fun, let’s apply a PCA decomposition
+to a covariance matrix $\Sigma_y$ that in fact is governed by our factor-analytic
+model.
 
-$$
-x_1 | y_0 \sim {\mathcal N}(A \tilde x_0 , A \tilde \Sigma_0 A' + C C' )
-$$
 
-Define
 
-$$
-\begin{aligned} \hat x_1 & = A \tilde x_0 \cr
-               \Sigma_1 & = A \tilde \Sigma_0 A' + C C'
-\end{aligned}
-$$
+Technically, this means that the PCA model is misspecified. (Can you
+explain why?)
 
-### Dynamic version
+Nevertheless, this exercise will let us study how well the first two
+principal components from a PCA can approximate the conditional
+expectations $E f_i | Y$ for our two factors $f_i$,
+$i=1,2$ for the factor analytic model that we have assumed truly
+governs the data on $Y$ we have generated.
 
-Suppose now that for $t \geq 0$,
-$\{x_{t+1}, y_t\}_{t=0}^\infty$ are governed by the equations
+So we compute the PCA decomposition
 
 $$
-\begin{aligned}
-x_{t+1} & = A x_t + C w_{t+1} \cr
-y_t & = G x_t + v_t
-\end{aligned}
+\Sigma_{y} = P \tilde{\Lambda} P^{\prime}
 $$
 
-where as before $x_0 \sim {\mathcal N}(\hat x_0, \Sigma_0)$,
-$w_{t+1}$ is the $t+1$th component of an i.i.d. stochastic
-process distributed as $w_{t+1} \sim {\mathcal N}(0, I)$, and
-$v_t$ is the $t$th component of an i.i.d. process
-distributed as $v_t \sim {\mathcal N}(0, R)$ and the
-$\{w_{t+1}\}_{t=0}^\infty$ and $\{v_t\}_{t=0}^\infty$
-processes are orthogonal at all pairs of dates.
+where $\tilde{\Lambda}$ is a diagonal matrix.
 
-The logic and
-formulas that we applied above imply that the probability distribution
-of $x_t$ conditional on
-$y_0, y_1, \ldots , y_{t-1} = y^{t-1}$ is
+We have
 
 $$
-x_t | y^{t-1} \sim {\mathcal N}(A \tilde x_t , A \tilde \Sigma_t A' + C C' )
+Y = P \epsilon
 $$
 
-where $\{\tilde x_t, \tilde \Sigma_t\}_{t=1}^\infty$ can be
-computed by iterating on the following equations starting from
-$t=1$ and initial conditions for
-$\tilde x_0, \tilde \Sigma_0$ computed as we have above:
+and
 
 $$
-\begin{aligned} \Sigma_t & = A  \tilde \Sigma_{t-1} A' + C C' \cr
-               \hat x_t & = A \tilde x_{t-1} \cr
-\beta_t & = \Sigma_t G' (G \Sigma_t G' + R)^{-1} \cr
-\tilde x_t & = \hat x_t + \beta_t ( y_t - G \hat x_t) \cr
- \tilde \Sigma_t & = \Sigma_t - \Sigma_t G' (G \Sigma_t G' + R)^{-1} G \Sigma_t
-  \end{aligned}
+\epsilon = P^\prime Y
 $$
 
-We can use the Python class *MultivariateNormal* to construct examples.
-
-Here is an example for a single period problem at time $0$
+Note that we will arrange the eigenvectors in $P$ in the
+*descending* order of eigenvalues.
 
 ```{code-cell} python3
-G = np.array([[1., 3.]])
-R = np.array([[1.]])
+𝜆_tilde, P = np.linalg.eigh(Σy)
 
-x0_hat = np.array([0., 1.])
-Σ0 = np.array([[1., .5], [.3, 2.]])
+# arrange the eigenvectors by eigenvalues
+ind = sorted(range(N), key=lambda x: 𝜆_tilde[x], reverse=True)
 
-μ = np.hstack([x0_hat, G @ x0_hat])
-Σ = np.block([[Σ0, Σ0 @ G.T], [G @ Σ0, G @ Σ0 @ G.T + R]])
+P = P[:, ind]
+𝜆_tilde = 𝜆_tilde[ind]
+Λ_tilde = np.diag(𝜆_tilde)
+
+print('𝜆_tilde =', 𝜆_tilde)
 ```
 
 ```{code-cell} python3
-# construction of the multivariate normal instance
-multi_normal = MultivariateNormal(μ, Σ)
+# verify the orthogonality of eigenvectors
+np.abs(P @ P.T - np.eye(N)).max()
 ```
 
 ```{code-cell} python3
-multi_normal.partition(2)
+# verify the eigenvalue decomposition is correct
+P @ Λ_tilde @ P.T
 ```
 
 ```{code-cell} python3
-# the observation of y
-y0 = 2.3
+ε = P.T @ y
 
-# conditional distribution of x0
-μ1_hat, Σ11 = multi_normal.cond_dist(0, y0)
-μ1_hat, Σ11
+print("ε = ", ε)
 ```
 
 ```{code-cell} python3
-A = np.array([[0.5, 0.2], [-0.1, 0.3]])
-C = np.array([[2.], [1.]])
+# print the values of the two factors
 
-# conditional distribution of x1
-x1_cond = A @ μ1_hat
-Σ1_cond = C @ C.T + A @ Σ11 @ A.T
-x1_cond, Σ1_cond
+print('f = ', f)
 ```
 
-### Code for Iterating
+Below we’ll plot several things
 
-Here is code for solving a dynamic filtering problem by iterating on our
-equations, followed by an example.
+- the $N$ values of $y$
+- the $N$ values of the principal components $\epsilon$
+- the value of the first factor $f_1$ plotted only for the first
+  $N/2$ observations of $y$ for which it receives a
+  non-zero loading in $\Lambda$
+- the value of the second factor $f_2$ plotted only for the final
+  $N/2$ observations for which it receives a non-zero loading in
+  $\Lambda$
 
 ```{code-cell} python3
-def iterate(x0_hat, Σ0, A, C, G, R, y_seq):
+plt.scatter(range(N), y, label='y')
+plt.scatter(range(N), ε, label='$\epsilon$')
+plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
+plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
+plt.legend()
 
-    p, n = G.shape
+plt.show()
+```
 
-    T = len(y_seq)
-    x_hat_seq = np.empty((T+1, n))
-    Σ_hat_seq = np.empty((T+1, n, n))
+Consequently, the first two $\epsilon_{j}$ correspond to the
+largest two eigenvalues.
 
-    x_hat_seq[0] = x0_hat
-    Σ_hat_seq[0] = Σ0
+Let’s look at them, after which we’ll look at $E f | y = B y$
 
-    for t in range(T):
-        xt_hat = x_hat_seq[t]
-        Σt = Σ_hat_seq[t]
-        μ = np.hstack([xt_hat, G @ xt_hat])
-        Σ = np.block([[Σt, Σt @ G.T], [G @ Σt, G @ Σt @ G.T + R]])
+```{code-cell} python3
+ε[:2]
+```
 
-        # filtering
-        multi_normal = MultivariateNormal(μ, Σ)
-        multi_normal.partition(n)
-        x_tilde, Σ_tilde = multi_normal.cond_dist(0, y_seq[t])
+```{code-cell} python3
+# compare with Ef|y
+B @ y
+```
 
-        # forecasting
-        x_hat_seq[t+1] = A @ x_tilde
-        Σ_hat_seq[t+1] = C @ C.T + A @ Σ_tilde @ A.T
+The fraction of variance in $y_{t}$ explained by the first two
+principal components can be computed as below.
 
-    return x_hat_seq, Σ_hat_seq
+```{code-cell} python3
+𝜆_tilde[:2].sum() / 𝜆_tilde.sum()
 ```
 
+Compute
+
+$$
+\hat{Y} = P_{j} \epsilon_{j} + P_{k} \epsilon_{k}
+$$
+
+where $P_{j}$ and $P_{k}$ correspond to the largest two
+eigenvalues.
+
 ```{code-cell} python3
-iterate(x0_hat, Σ0, A, C, G, R, [2.3, 1.2, 3.2])
+y_hat = P[:, :2] @ ε[:2]
 ```
 
-The iterative algorithm just described is a version of the celebrated **Kalman filter**.
+In this example, it turns out that the projection $\hat{Y}$ of
+$Y$ on the first two principal components does a good job of
+approximating $Ef \mid y$.
 
-We describe the Kalman filter  and some applications of it in {doc}`A First Look at the Kalman Filter <kalman>`
+We confirm this in the following plot of $f$,
+$E y \mid f$, $E f \mid y$, and $\hat{y}$ on the
+coordinate axis versus $y$ on the ordinate axis.
+
+```{code-cell} python3
+plt.scatter(range(N), Λ @ f, label='$Ey|f$')
+plt.scatter(range(N), y_hat, label='$\hat{y}$')
+plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
+plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
+
+Efy = B @ y
+plt.hlines(Efy[0], 0, N//2-1, ls='--', color='b', label='$Ef_{1}|y$')
+plt.hlines(Efy[1], N//2, N-1, ls='-.', color='b', label='$Ef_{2}|y$')
+plt.legend()
+
+plt.show()
+```
+
+The covariance matrix of $\hat{Y}$ can be computed by first
+constructing the covariance matrix of $\epsilon$ and then use the
+upper left block for $\epsilon_{1}$ and $\epsilon_{2}$.
+
+```{code-cell} python3
+Σεjk = (P.T @ Σy @ P)[:2, :2]
+
+Pjk = P[:, :2]
+
+Σy_hat = Pjk @ Σεjk @ Pjk.T
+print('Σy_hat = \n', Σy_hat)
+```