## Probability

### The Analysis of Data, volume 1

Multivariate Differentiation and Integration

## F.6. Multivariate Differentiation and Integration

Many of the definitions above generalize to functions $f:\R^n\to\R$ and even functions $\bb f:\R^n\to\R^m$. We denote the latter in bold-face to emphasize that $\bb f(\bb x)$ is a vector. Its components are denoted by subscript functions: $\bb f(\bb x) = (f_1(\bb x),\ldots,f_m(\bb x))$, where $f_i:\R^n\to \R$, $i=1,\ldots,m$.
Definition F.6.1. The partial derivative $\partial f(\bb x)/\partial x_j$ of a function $f:\R^n\to\R$ is $\frac{\partial f(\bb x)}{\partial x_j} =\lim_{t\to x_j} \frac{f(x_1,\ldots,x_{j-1},t,x_{j+1},\ldots,x_n)-f(\bb x)}{t-x}.$

In other words, the partial derivative $\partial f(\bb x)/\partial x_j$ is the regular derivative $dg(x)/dx$ of the function $g(x)$ obtained by setting $x=x_j$ and fixing the remaining components of $\bb x$ as constants.

Definition F.6.2. We define the following generalization of first and second order derivatives.
1. For a function $\bb f:\R^n\to\R^m$, we define its derivative matrix $\nabla {\bb f}\in\R^{m\times n}$ given by $[\nabla {\bb f}]_{ij}=\partial f_i(\bb x)/\partial x_j$. This matrix is sometimes called the Jacobian matrix and is often denoted by the letter $J$. If $m=1$, we call the resulting $1\times d$ matrix the gradient vector.
2. For a function $f:\R^{n}\to \R$, we define its directional derivative along a vector $\bb v\in\R^n$ as $D_{\bb v}(\bb x)={\bb v}^{\top}\nabla f(\bb x)$.
3. For a function $f:\R^n\to \R$, we define the second derivative or Hessian matrix $\nabla^2 f\in\R^{n\times n}$ as the matrix of second order partial derivatives $[\nabla^2 f]_{ij}= \frac{\partial^2 f(\bb x)}{\partial x_i\partial x_j} \defeq \frac{\partial}{\partial x_j} \frac{\partial f(\bb x)}{\partial x_i}.$
Proposition F.6.1. For $\bb f:\R^n\to\R^m$, $\bb g:\R^m\to\R^s$, we have
1. $\nabla({\bb g(\bb f(\bb x))}) = (\nabla{\bb g})(\bb f(\bb x)) \nabla \bb f(\bb x),$
2. $\nabla ({\bb f(\bb x)^{\top}\bb g(\bb x)}) = \bb g(\bb x)^{\top}\nabla \bb f(\bb x)+\bb f(\bb x)^{\top}\nabla{\bb g}(\bb x)$, and
3. if $\nabla \bb f(\bb x)$ is continuous over $\bb x\in B_{r}(\bb y)$, then for some $t\in B_{r}(\bb 0)$, $\bb f(\bb y+\bb t) = \bb f (\bb y)+\int_0^1 \nabla \bb f(\bb y+z\bb t)\, dz \,\,\bb t.$

The third part of the proposition above is called the multivariate mean value theorem. The integral in it represents a matrix whose entries are the integrals of the corresponding argument, and thus the second term on the right hand side of the third statement is a product of a matrix and a (column) vector. The entire equation is therefore a vector equation with vectors on both sides of the equality symbol.

Proof. Statements 1 and 2 are restatements of the chain rule and and the product rule for partial derivative in matrix form. To prove statement 3, consider the following vector equation (integrals over vectors are interpreted as vectors containing the corresponding component-wise integrals) \begin{align*} \int_0^1 (\nabla f(\bb y + z\bb t)) \bb t \, dz &= \int_0^1 \frac{d\bb f(\bb y + z\bb t)}{dz} \, dz = \bb f(\bb y+\bb t)- \bb f(\bb y), \end{align*} where the first equality follows from the chain rule (statement 1 above) and the second equality follows from the vector form of the fundamental theorem of calculus (Proposition F.2.2).

The multivariate mean value theorem (part 3 of the proposition above) forms the most basic multivariate Taylor series approximation. Specifically, it implies that if the derivatives are bounded, we have the following approximation of $\bb f$ using a constant function. \begin{align*} f(\bb y+\bb t) = f(\bb y) + o(1), \qquad \text{as}\qquad \bb t\to \bb 0 \end{align*} Applying the multivariate mean value theorem twice, we get the proposition below, which implies the following linear approximation of $\bb f$ \begin{align*} f(\bb y+\bb t) = f(\bb y) + \nabla f(\bb y) \bb t + o(\|\bb t\|^{-1}) \approx f(\bb y) + \nabla f(\bb y) \bb t, \qquad \text{as}\qquad \bb t\to \bb 0. \end{align*}

Proposition F.6.2 (Multivariate Taylor Series Theorem (second order)). Let $f:\R^d\to \R$ be a function whose higher order derivatives exist in $B_{r}(\bb y)$. Then for all $\bb t\in B_{r}(\bb 0)$, \begin{align} f(\bb y+\bb t) &= f(\bb y) + \nabla f(\bb y)\bb t + {\bb t}^{\top} \int_0^1\int_0^1 z \nabla^2 f(\bb y + zw \bb t)\, dz dw\, \bb t. \end{align}
Proof. Applying twice the multivariate mean value theorem, we have \begin{align*} \nabla f(\bb y+z\bb t) &= \nabla f(\bb y)+\int_0^1 \nabla^2 z f(\bb y+z\bb t)\, dz\, \bb t\\ \bb f(\bb y+\bb t) &= f (\bb y)+\int_0^1 \nabla f(\bb y+z\bb t)\, dz \,\,\bb t\\ &= f (\bb y)+\int_0^1 \nabla f(\bb y)\, dz \,\,\bb t + {\bb t}^{\top} \int_0^1\int_0^1\nabla^2 f(\bb y+zw\bb t) z\, dzdw\,\,\bb t\\ &= f (\bb y)+\nabla f(\bb y) \bb t + {\bb t}^{\top} \int_0^1\int_0^1\nabla^2 f(\bb y+zw\bb t) z\, dzdw\,\,\bb t. \end{align*}

The Taylor series above is given with an integral remainder. The remainder may also be expressed in a differential way: \begin{align*} f(\bb x) &= f(\bb \alpha) + (\nabla f(\bb \alpha))^{\top} (\bb x-\bb \alpha) +\frac{1}{2}(\bb x-\bb \alpha)^{\top} \nabla^2 f(\bb \beta) (\bb x-\bb \alpha) \end{align*} for some $\bb\beta$ on the line segment connecting $\bb x$ and $\bb\alpha$.

Proposition F.6.3. (Multivariate Change of Variables). Let $\bb f$ be a differentiable function $\bb f:\R^k\to\R^k$ whose Jacobian matrix is non-singular in the domain of $\bb f$, and $g$ be a continuous function. Then \begin{align*} \int_{\bb f(A)} g(\bb x)\, d\bb x = \int_A g(\bb f(\bb y))\, |\det J(\bb f(\bb y))| \, d\bb y. \end{align*}
A proof of Proposition F.6.4 appears for example in (Rudin, 1976, Chapter 10). We describe below two examples that use the multivariate change of variables method to solve a multivariate integral. Both examples have application in probability theory and are used in the book elsewhere. A third example is available in the proof of Proposition 5.3.1.
Example F.6.1. The transformation from the Cartesian coordinates $(x,y)$ to polar coordinates $(r,\theta)$ is \begin{align*} r &= \sqrt{x^2+y^2}\\ \theta &= \tan^{-1}(y/x) \end{align*} and the inverse transformation $(r,\theta)\mapsto (x,y)$ is given by \begin{align*} x&=r\cos \theta\\ y&=r\sin\theta. \end{align*} The space $\R^2$ is realized by either $(x,y)\in\R^2$ or by $(r,\theta) \in [0,\infty)\times [0,2\pi)$. The Jacobian of the mapping $(r,\theta) \mapsto (x,y)$ is $J= \begin{pmatrix} \cos\theta & -r\sin\theta\\ \sin\theta&r\cos\theta \end{pmatrix} \qquad \det J=r(\cos^2\theta+\sin^2\theta)=r\cdot 1 = r.$ An important application of the polar coordinates change of variable is the following calculation of the Gaussian integral \begin{align}\nonumber \left(\int_{\mathbb{R}} e^{-x^2/2} dx\right)^2 &= \left(\int_{\mathbb{R}} e^{-x^2/2} dx\right) \left(\int_{\mathbb{R}} e^{-y^2/2} dy\right)= \iint_{\mathbb{R}^2} e^{-(x^2+y^2)/2} dxdy \\ &= \int_0^{2\pi}\int_0^{\infty}e^{-r^2/2} r\, drd\theta = 2\pi \int_0^{\infty}e^{-r^2/2} r\, dr = 2\pi\cdot 1 \end{align} where the last equality follows from Example F.2.4. We therefore have \begin{align} \label{eq:int:GausInt} \int_{\mathbb{R}} e^{-x^2/2} dx = \sqrt{2\pi}, \end{align} showing that the Gaussian pdf from Section 3.9 integrates to one.
Example F.3.2. The Beta function is defined as follows \begin{align*} B(\alpha,\beta) \defeq \int_0^1 x^{\alpha-1}(1-x)^{\beta-1} \,dx, \quad \alpha,\beta>0. \end{align*} Recalling the definition of the Gamma function $\Gamma(x)$ in Definition 3.10.1, we have \begin{align*} \Gamma(\alpha)\Gamma(\beta) &= \int_0^{\infty}\int_0^{\infty} u^{\alpha-1}v^{\beta-1}e^{-u-v}\, dudv\\ &= \int_0^{\infty}\int_0^1 (zt)^{\alpha-1}(z(1-t))^{\beta-1}e^{-z}\, zdtdz\\ &= \int_0^{\infty}e^{-z}(z)^{\alpha+\beta-1} \,dz \int_0^1 t^{\alpha-1} (1-t)^{\beta-1}\, dt\\ &= \Gamma(\alpha+\beta)B(\alpha,\beta), \end{align*} where we used the variable transformation $(u,v)\mapsto (zt)$ expressed by $u=zt$, $v=z(1-t)$ and $z=u+v$, $t=u/(u+v)$, whose jacobian is $\Bigg|\det \begin{pmatrix} t & z \\ 1-t & -z \end{pmatrix}\Bigg| = |-tz -z +tz|=z.$ The derivation above implies the following relationship between the Beta and Gamma functions \begin{align*} B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}. \end{align*}