Probability
The Analysis of Data, volume 1
Random Vectors: Moments
$
\def\P{\mathsf{\sf P}}
\def\E{\mathsf{\sf E}}
\def\Var{\mathsf{\sf Var}}
\def\Cov{\mathsf{\sf Cov}}
\def\std{\mathsf{\sf std}}
\def\Cor{\mathsf{\sf Cor}}
\def\trace{\mathsf{\sf trace}}
\def\R{\mathbb{R}}
\def\c{\,|\,}
\def\bb{\boldsymbol}
\def\diag{\mathsf{\sf diag}}
\def\defeq{\stackrel{\tiny\text{def}}{=}}
$
4.6. Moments
Definition 4.6.1.
The expectation of a random vector $\bb X=(X_1,\ldots,X_n)$ is the vector of expectations of the corresponding random variables
\[ \E(\bb X) \defeq (\E(X_1),\ldots,\E(X_n)) \in \R^n.\]
In some cases we arrange the components of a random vector $\bb X=(X_1,\ldots,X_n)$ in a matrix form. In this case the expectation is defined similarly as a matrix whose entries are the expectations of the corresponding RVs. It is common to use vector or matrix equations when dealing with random vectors or matrices in this cases. For example,
\[ 1+\E\begin{pmatrix}X_1 & X_2\\X_3 & X_4 \end{pmatrix} = \begin{pmatrix}1+\E X_1 & 1+\E X_2 \\1+\E X_3 &1+ \E X_4 \end{pmatrix}.\]
The following example, shows that the linearity property of expectation from Chapter 2 extends to random vectors and random matrices. It may be verified by straightforward application of the linearity properties of expectation (see Chapter 2) and vector and matrix addition and multiplication.
Example 4.6.1.
For a vector of RVs $\bb X$ we have the following vector equality
\[\E(a\bb X+\bb b)=a\E(\bb X)+\bb b.\]
For a matrix of RVs $\bb Y$, we have the following matrix equality
\[\E(A\bb Y B+C)=A\E(\bb Y)B+C.\]
Above, we assume that $\bb b$ is a constant vector, $a$ is a constant scalar, and $A,B,C$ are constant matrices.
The following proposition generalizes Proposition 2.3.1.
Proposition 4.6.1.
For a random vector $\bb X=(X_1,\ldots,X_n)$ and a function $g:\R^n\to\R$,
\begin{align*}
\E(g(\bb X))=
\begin{cases}
\int_{\R^n} g(\bb{x}) f_{\bb{X}}(\bb{x}) \,
d\bb{x} & \bb{X} \text{ is continuous}\\
\sum_{\bb{x}\in\R^n} g(\bb{x})
p_{\bb{X}}(\bb{x}) & \bb{X} \text{ is discrete}
\end{cases}.
\end{align*}
Proof.
The proof is identical to the proof of Proposition 2.3.1.
If $g(\bb X)$ is a function of only a subset of the component RVs $X_1,\ldots,X_n$ (for example $g(\bb X)=(X_1+X_3)/2$), the expression in Proposition 4.6.1 may be simplified by taking the integration or summation only with respect to the components of $\bb{X}$ present in $g(\bb{X})$. In this case $p_{\bb{X}}$ or $f_{\bb{X}}$ may be replaced with the joint pmf or pdf over the reduced set of variables. The following two examples demonstrate this principle and justify its application.
Example 4.6.2.
For a continuous random vector $\bb{X}=(X_1,X_2,X_3)$ and $g(\bb{X})=X_1/X_2$,
\begin{align*}
\E(X_1 / X_2) &= \iiint f_{X_1,X_2,X_3}(x_1,x_2,x_3) x_1/x_2 \, dx_1 dx_2 dx_3\\ &= \iiint f_{X_3 \c X_1=x_1,X_2=x_2}(x_3) f_{X_1,X_2}(x_1,x_2) x_1/x_2 \, dx_1 dx_2 dx_3\\
&= \iiint f_{X_3 \c X_1=x_1,X_2=x_2}(x_3) \, dx_3 f_{X_1,X_2}(x_1,x_2) x_1/x_2 \, dx_1 dx_2 \\
&= 1 \cdot \iint f_{X_1,X_2}(x_1,x_2) x_1/x_2 \, dx_1 dx_2.
\end{align*}
A similar expression holds in the discrete case (replace pdfs with pmfs and integrals with sums).
Example 4.6.3.
Taking expectation of $X_j$ with respect to the marginal $f_{X_j}$ or with respect to the joint $f_{\bb X}$ produces identical results:
\begin{align*}
&\E(X_j)\\
&=\int\cdots\int x_j f_{\bb X}(x_1,\ldots,x_n)\,dx_1\cdots dx_n\\
&= \int x_j f_{X_j}(x_j) \, dx_j \int \cdots \int f_{\{X_i:i\neq j\} \c X_j=x_j}(\{x_i:i\neq j\}) \,dx_1\cdots dx_{j-1} dx_{j+1}\cdots dx_n \\ &=\int x_j f_{X_j}(x_j) \, dx_j \cdot 1.
\end{align*}
A similar expression holds in the discrete case (replace pdfs with pmfs and integrals with sums).
Definition 4.6.2.
The covariance of two random variables $X,Y$ is
\[\Cov(X,Y)\defeq \E((X-\E(X))(Y-\E(Y))).\]
The covariance matrix of two random vectors $\bb{X}=(X_1,\ldots,X_n)$, $\bb{Y}=(Y_1,\ldots,Y_m)$ is the $n\times m$ matrix defined by
\begin{align*}
[\Cov(\bb{X},\bb{Y})]_{ij} &= \Cov(X_i,Y_j),
\end{align*}
or using matrix notation (assuming $\bb X,\bb Y$ are column vectors)
\[\Cov(\bb{X},\bb{Y}) = \E((\bb{X}-\E(\bb{X}))(\bb{Y}-\E(\bb{Y}))^{\top})\]
(where $A^{\top}$ is the transpose of $A$). The variance matrix of the random vector $\bb{X}=(X_1,\ldots,X_n)$ is the $n\times n$ matrix defined by
\begin{align*}
[\Var(\bb{X})]_{ij} &= \Cov(X_i,X_j),
\end{align*}
or using matrix notation (assuming $\bb X,\bb Y$ are column vectors)
\[\Var(\bb{X}) = \Cov(\bb{X},\bb{X}) = \E((\bb{X}-\E(\bb{X}))(\bb{X}-\E(\bb{X}))^{\top}).\]
Note that the variance of a random variable $\Var(X)$ is in agreement with the $1\times 1$ variance matrix of a random vector with one component $\bb X=(X)$.
Proposition 4.6.2.
\begin{align*}
\Cov(X_i,X_j)&=\E(X_iX_j)-\E(X_i)\E(X_j) \\
\Cov(\bb{X},\bb{Y}) &= \E(\bb{X}\bb{Y}^{\top}) - \E(\bb{X}) \E(\bb{Y})^{\top}.
\end{align*}
Proof.
The following derivation proves the first statement
\begin{align*}
\Cov(X_i,X_j)&=\E(X_i X_j-X_i \E(X_j)-X_j \E(X_i)+\E(X_i)\E(X_j))\\ &=\E(X_iX_j)-2\E(X_i)\E(X_j)+\E(X_i)\E(X_j)\\ &=\E(X_iX_j)-\E(X_i)\E(X_j).
\end{align*}
The second statement follows by applying the first statement to each element of the relevant matrices.
Proposition 4.6.3
(Linearity Property of Expectation)
\[ \E\left(\sum_{i=1}^n a_iX_i\right)=\sum_{i=1}^n a_i\E (X_i)\]
Proof.
Using the argument in Example 4.6.3 we have
\begin{align*}
\E(aX+bY)&=\iint_{\R^2} (ax+by)f_{X,Y}(x,y)\,
dxdy\\ &=\int_{-\infty}^{\infty} ax \int_{-\infty}^{\infty}
f_{X,Y}(x,y)\,dydx +\int_{-\infty}^{\infty} by\int
f_{X,Y}(x,y)\,dxdy\\ &=a\E(X)+b\E(Y).
\end{align*}
(If $X,Y$ are discrete replace integrals with sums and
pdf functions with pmf functions). Using induction completes the proof.
We emphasize that the linearity of expectation in Proposition 4.6.3 above holds for any random vector $\bb X$, including the case where the variables $X_i$ are dependent.
Proposition 4.6.4
For a random vector $\bb{X}$ with independent components and any functions $g_i:\mathbb{R}\to\mathbb{R}$, $i=1,\ldots,n$ we have
\begin{align*}
E\left(\prod_{i=1}^n g_i(X_i)\right) = \prod_{i=1}^n \E(g_i(X_i)).
\end{align*}
In particular for independent $X,Y$ we have $\E(XY)=\E(X)\E(Y)$.
Proof.
\begin{align*}
\E(g_1(X)g_2(Y))&=\iint_{\R^2} g_1(x)g_2(y)f_{X,Y}(x,y)\,
dxdy\\ &=\iint_{\R^2} g_1(x)g_2(y)f_{X}(x)f_Y(y)\, dxdy\\
&=\left(\int_{-\infty}^{\infty} g_1(x)f_X(x)dx\right)
\left(\int_{-\infty}^{\infty} g_2(y)f_Y(y)dy\right)
\\ &=\E(g_1(X))\E(g_2(Y)).
\end{align*}
The proof for discrete RVs is similar (replace
integrals with sums and pdf functions with pmf functions). Using induction completes the proof.
Corollary 4.6.1
If $\bb{X}$ is an independent random vector, then $\Var(\bb{X})$ is a diagonal matrix, or in other words $i\neq j$ implies $\Cov(X_i,X_j)=0$.
Proof.
According to Definition 4.6.2 \[\Cov(X,Y)=E(XY)-E(X)E(Y),\]
which by Proposition 4.6.5 is zero for independent $X,Y$.
While independence implies zero covariance, the converse is not necessarily true: two dependent RVs may have zero or non-zero covariance. Intuitively, $\Cov(X,Y)$ measures the extent to which there exists a linear relationship between $X$ and $Y$: $X=\alpha Y+\beta$, $\alpha,\beta\in\R$. If there is no linear relationship, the covariance is zero but the variables may still be dependent. The bottom row of the six samples in Section 4.1.1 displays examples of dependent RVs with zero covariance.
Proposition 4.6.5
\begin{align*}
\Var\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n\sum_{j=1}^n \Cov(X_i,X_j)=\sum_{i=1}^n \Var(X_i) + 2\sum_{i=1}^n\sum_{j>i} \Cov(X_i,X_j).
\end{align*}
In particular for any two RVs $X,Y$,
\begin{align*}
\Var(X+Y)&=\Var(X)+\Var(Y)+2\Cov(X,Y)\\
\Var(X-Y)&=\Var(X)+\Var(Y)-2\Cov(X,Y).
\end{align*}
Proof.
For two RVs,
\begin{align*}
&\Var(X+Y)\\ &=\E(((X+Y)-\E(X+Y))^2)\\
&=\E(X^2+Y^2+2XY+(\E(X))^2+(\E(Y))^2+2\E(X)\E(Y)-2(X+Y)\E(X+Y))\\
&=\E(X^2)+\E(Y^2)+2\E(XY)+(\E(X))^2+(\E(Y))^2+2\E(X)\E(Y)-2(\E(X)+\E(Y))^2\\
&= \E(X^2) - (\E(X))^2 + \E(Y^2) - (\E(Y))^2
+ 2 \E(XY)-2\E(X)\E(Y)\\
&= \Var(X)+\Var(Y)+2\Cov(X,Y).
\end{align*}
The case of $X-Y$ reduces to $X+(-Y)$, causing the covariance to receive a minus sign. The more general case follows by induction.
Corollary 4.6.2
If $\bb{X}=(X_1,\ldots,X_n)$ is an independent random vector
\[\Var\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \Var(X_i).\]
Proof.
The proof follows the proposition above and the fact that independent variables have zero covariance.
Proposition 4.6.6
\[\E(\bb{X}^{\top}A\bb{X})=\trace (A \Var(\bb X))+(\E(\bb X))^{\top}A\E(\bb X).\]
Proof.
Denoting $\E(\bb X) = \bb{\mu}$ and $\Var(\bb X)=\Sigma$,
\begin{align*}
\E(\bb{X}^{\top}A\bb{X})&=\trace(\E(\bb{X}^{\top}A\bb{X})\\ &=\E(\trace(\bb{X}^{\top}A\bb{X})\\
&=\E(\trace(A\bb{X}\bb{X}^{\top})\\
&=\trace(\E(A\bb{X}\bb{X}^{\top}))\\
&=\trace(A\E(\bb{X}\bb{X}^{\top}))\\
&=\trace(A(\Var(\bb{X})+\bb{\mu}\bb{\mu}^{\top})\\
&=\trace(A\Sigma)+\trace(A\bb{\mu}\bb{\mu}^{\top})\\
&=\trace(A\Sigma)+\bb{\mu}^{\top}A\bb{\mu}.
\end{align*}
Proposition 4.6.7
\[ \Cov(A\bb{X},B\bb{Y}) =A \Cov(\bb{X},\bb{Y})B^{\top}\]
Proof.
Using the linearity property of the expectation (Proposition 4.6.3),
\begin{align*}
\Cov(A\bb{X},B\bb{Y}) &= \E((A\bb{X}-A\E(X))(B\bb{Y}-B\E(Y))^{\top})\\
&=\E(A(\bb{X}-\E(X))((\bb{Y}-\E(Y)^{\top}B^{\top})))
\\ &=A\E((\bb{X}-\E(X))((\bb{Y}-\E(Y)^{\top})))B^{\top}\\
&=A \Cov(\bb{X},\bb{Y})B^{\top}.
\end{align*}
Corollary 4.6.3
For any matrix $A$ and random vector $\bb{X}$
\begin{align}
\Var(A\bb{X})=A\Var(\bb{X})A^{\top}.
\end{align}
Proposition 4.6.8
The matrix $\Var(\bb{X})$ is positive semi-definite. It is positive definite if no component of $\bb{X}$ is a linear combination of the other components.
Proof.
Recall that the variance is an expectation of a non-negative (squared) random variable. Using Corollary 4.6.3 we have for all column vectors $\bb v$,
\begin{align*}
0 \leq \Var(\bb{v}^{\top}\bb{X})=\bb{v}^{\top}\Var(\bb{X})\bb{v},
\end{align*}
which shows that the variance matrix is positive semi-definite. The inequality above holds with equality if and only if $\bb{v}$ is the zero vector or $\bb{v}^{\top}\bb{X}$ is a deterministic RV taking a constant value with probability 1. This can only happen if one component of $\bb X$ is a linear combination of the other components.
Definition 4.6.3.
The correlation coefficient of two RVs $X,Y$ is defined as
\[\Cor(X,Y)=\frac{\Cov(X,Y)}{\sqrt{\Var(X)}\sqrt{\Var(Y)}}.\]
Proposition 4.6.7
For any random variables $X,Y$ we have
\[-1\leq \Cor(X,Y)\leq 1,\]
where the inequality above holds with equality if and only if there is a linear relationship between $X$ and $Y$, specifically $X=aY+b$ or $Y=aX+b$ for some $a,b\in\R$.
Proof.
Since expectation of non-negative RVs is non-negative
\begin{align*}
0&\leq E\left(\left(\frac{X-E(X)}{\sqrt{\Var(X)}} \pm
\frac{Y-E(Y)}{\sqrt{\Var(Y)}} \right)^2 \right) \\ &=
\frac{E((X-E(X))^2)}{\Var{X}}+\frac{E((Y-E(Y))^2)}{\Var{Y}} \pm
2\Cor(X,Y)\\ &=2(1\pm\Cor(X,Y)).
\end{align*}
(The notation $\pm$ above means that the entire derivation may be repeated alternatively with a plus and a minus sign.) The proposition follows.