Probability
The Analysis of Data, volume 1
Random Vectors: Conditional Probabilities and Random Vectors
$
\def\P{\mathsf{\sf P}}
\def\E{\mathsf{\sf E}}
\def\Var{\mathsf{\sf Var}}
\def\Cov{\mathsf{\sf Cov}}
\def\std{\mathsf{\sf std}}
\def\Cor{\mathsf{\sf Cor}}
\def\R{\mathbb{R}}
\def\c{\,|\,}
\def\bb{\boldsymbol}
\def\diag{\mathsf{\sf diag}}
\def\defeq{\stackrel{\tiny\text{def}}{=}}
$
4.5. Conditional Probabilities and Random Vectors
Conditional probabilities for random vectors are defined similarly to the scalar case. Considering a joint distribution over the random vector $\bb{Z}=(\bb{X},\bb{Y})$, the conditional probability $\P(\bb X\in A \c \bb Y=\bb y)$ reflects an updated likelihood for the event $\bb X\in A$ given that $\bb Y=\bb y$.
The conditional cdf, pdf, and pmf are defined as follows
\begin{align}
F_{\bb{X} \c \bb{Y}=\bb{y}}(\bb{x}) &= \begin{cases}
\P(\bb{X} \leq \bb{x}, \bb{Y}=\bb{y}) / p_{\bb{Y}}(\bb{y}) & \bb{Y} \text{ is discrete} \\
\P(\bb{X} \leq \bb{x}, \bb{Y}=\bb{y}) / f_{\bb{Y}}(\bb{y}) & \bb{Y} \text{ is continuous} \end{cases} \\
f_{\bb{X} \c \bb{Y}=\bb{y}}(\bb{x}) &= \frac{ \partial^n}{\partial x_1\cdots\partial x_n} F_{\bb{X} \c \bb{Y}=\bb{y}}(\bb{x}) \\
p_{\bb{X} \c \bb{Y}=\bb{y}}(\bb{x}) &= \frac{\P(\bb{X} = \bb{x}, \bb{Y}=\bb{y})}{\P(\bb{Y}=\bb{y})} = \frac{p_{\bb{X},\bb{Y}}((\bb{x},\bb{y}))}{p_{\bb{Y}}(\bb{y})}.
\end{align}
Note that we assume above that $f_{\bb Y}(\bb y)$ and $p_{\bb Y}(\bb y)$ are not zero.
When both $\bb X$ and $\bb Y$ are continuous their joint cdf is differentiable and
\begin{align*}
f_{\bb{X} \c \bb{Y}=\bb{y}}(\bb{x}) &= \frac{f_{\bb{X},\bb{Y}}(\bb x,\bb{y})}{f_{\bb Y}(\bb y)}.
\end{align*}
Computing conditional probabilities from the conditional pdf and pmf proceeds as in the non-conditional case, by integrating over the corresponding pdf or summing over the corresponding pmf. The proof is similar to the scalar case (see Chapter 2).
\begin{align*}
\P(\bb{X}\in A \c \bb{Y}=\bb{y})=
\begin{cases}
\int_A f_{\bb{X} \c \bb{Y}=\bb{y}}(\bb{x})d\bb{x} & \bb{X} \text{ is continuous}\\
\sum_{\bb{x}\in A} p_{\bb{X} \c \bb{Y}=\bb{y}}(\bb{x}) & \bb{X} \text{ is discrete}
\end{cases}.
\end{align*}
Example 4.5.1.
\begin{align*}
F_{X_2 \c X_1=x_1,X_3=x_3}(x_2) = \begin{cases}\frac{\P( X_1=x_1, X_2\leq x_2, X_3=x_3)}{\sum_{x_2} p_{\bb{X}}(X_1=x_1, X_2=x_2, X_3=x_3)}& \bb{X} \text{ is discrete}\\
\frac{\P( X_1=x_1, X_2\leq x_2, X_3=x_3)}{\int f_{\bb{X}}(X_1=x_1, X_2=x_2,X_3=x_3)\,dx_2} & \bb{X} \text{ is continuous} \end{cases}
\end{align*}
Example 4.5.2.
\begin{align*}
f_{X_i \c \{X_j=x_j:j\neq i\}}(x_i) &= \frac{\frac{d}{dx_i}
\int_{-\infty}^{x_i} f_{X_1,\ldots,X_n}(x_1,\ldots,x_n)dx_i}{
f_{X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n}(x_1,\ldots,x_{i-1},x_{i+1},\ldots,x_n)}\\
&=\frac{f_{X_1,\ldots,X_n}(x_1,\ldots,x_n)}{\int_{-\infty}^{\infty}
f_{X_1,\ldots,X_n}(x_1,\ldots,x_n)dx_i}.
\end{align*}
In the case of $n=3$, we have
\begin{align*}f_{X_2 \c X_1=x_1,X_3=x_3}(x_2) &=
\frac{d}{d x_2}
F_{X_2 \c X_1=x_1,X_3=x_3}(x_2)\\ &=\frac{f_{X_1,X_2,X_3}(x_1,x_2,x_3)}{
\int_{-\infty}^{\infty}f_{X_1,X_2,X_3}(x_1,x_2,x_3) dx_2}.\end{align*}
The above formulas lead to the following generalization of the Bayes rule for events $\P(A \c B)=\P(B \c A)\P(A) / \P(B)$ (Proposition 1.5.2).
Proposition 4.5.1 (Bayes Rule).
\begin{align*}
f_{\bb X}(\bb X) &= f_{X_i \c \{X_j=x_j:j\neq
i\}}(x_i)
f_{X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n}(x_1,\ldots,x_{i-1},
x_{i+1}, \ldots, x_n)\\
p_{\bb X}(\bb X) &= p_{X_i \c \{X_j=x_j:j\neq
i\}}(x_i)
p_{X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n}(x_1,\ldots,x_{i-1},
x_{i+1}, \ldots, x_n)
\end{align*}
Proof.
The pdf formula follows from Example 4.5.2. The derivation of the pmf formula is similar.
Corollary 4.5.1.
\begin{align*}
f_{X_1,\ldots,X_n}(x_1,\ldots,x_n) &= f_{X_1}(x_1)
f_{X_2 \c X_1=x_1}(x_2)f_{X_3 \c X_1=x_1,X_2=x_2}(x_3) \cdots f_{X_n \c X_1=x_1,\ldots,X_{n-1}=x_{n-1}}(x_n).\\
p_{X_1,\ldots,X_n}(x_1,\ldots,x_n) &= p_{X_1}(x_1)
p_{X_2 \c X_1=x_1}(x_2)p_{X_3 \c X_1=x_1,X_2=x_2}(x_3) \cdots p_{X_n \c X_1=x_1,\ldots,X_{n-1}=x_{n-1}}(x_n).
\end{align*}
Proof.
The proof follows repeated use of Proposition 4.5.1.
The ordering of $X_1,\ldots,X_n$ in the decomposition above is arbitrary and similar formulas hold when the variables are relabeled. For example replace $X_1$ with $X_2$, $X_2$ with $X_3$, and $X_3$ with $X_1$, or any other arbitrary relabeling (Formally, given a permutation function $\pi:\{1,\ldots,n\}\to\{1,\ldots,n\}$, which is a one-to-one and onto function, a relabeling of the vector $(X_1,\ldots,X_n)$ is the vector $(X_{\pi(1)},\ldots,X_{\pi(n)})$.). For example the following two equations hold.
\begin{align*}
f_{X_1,X_2,X_3}(\bb x) &= f_{X_1} (x_1) f_{X_2 \c X_1=x_1}(x_2) f_{X_3 \c X_1=x_1,X_2=x_2}(x_3)\\
f_{X_1,X_2,X_3}(\bb x) &= f_{X_2} (x_2) f_{X_3 \c X_2=x_2}(x_3) f_{X_1 \c X_2=x_2,X_3=x_3}(x_1).
\end{align*}
Example 4.5.2.
Suppose that a point $X$ is chosen from a uniform
distribution in the interval $[0,1]$ and that after $X=x$ is
observed a point $Y$ is drawn from a uniform distribution on the
interval $[x,1]$. We have
\begin{align*}
f_{X,Y}(x,y) &= f_X(x)f_{Y \c X=x}(x)=
\begin{cases}
1\cdot \frac{1}{1-x} & 0 < x < y < 1 \\ 0 & \text{otherwise}
\end{cases}\\
f_Y(y)&=
\begin{cases} \int_{-\infty}^{\infty}f_{X,Y}(x,y)
dx=\int_0^y\frac{1}{1-x}dx=-\log(1-y) & 0 < y < 1 \\
0 & \text{otherwise}
\end{cases}\\
f_{X \c Y=y}(x)&=f_{X,Y}(x,y)/f_{Y}(y)=
\begin{cases} \frac{-1}{(1-x)\log(1-y)} & 0 < x < y < 1\\ 0 &\text{otherwise}
\end{cases}.
\end{align*}