Basic Definitions: Conditional Probability and Independence

Probability

The Analysis of Data, volume 1

0
- Front Matter
- 0.1: Contents
- 0.2: Preface
1
2
- Random Variables
- 2.1: Basic Definitions
- 2.2: Functions of RVs
- 2.3: Expectation and Variance
- 2.4: Moments and MGF
- 2.5: RVs and Measure Theory
- 2.6: Notes
- 2.7: Exercises
3
4
5
- Important Vectors
- 5.1: Multinomial Vectors
- 5.2: Gaussian Vectors
- 5.3: Dirichlet Vectors
- 5.4: Mixture Vectors
- 5.5: Exponential Family
- 5.6: Notes
- 5.7: Exercises
6
- Random Processes
- 6.1: Basic Definitions
- 6.2: Marginals
- 6.3: Moments
- 6.4: Random Walk
- 6.5: Processes and Measure
- 6.6: Borell-Cantelli and Zero-One
- 6.7: Notes
- 6.8: Exercises
7
- Important RPs
- 7.1: Markov Chains
- 7.2: Poisson Process
- 7.3: Gaussian Process
- 7.4: Notes
- 7.5: Exercises
8
A
- Set Theory
- A.1: Basic Definition
- A.2: Functions
- A.3: Cardinality
- A.4: Limits of Sets
- A.5: Notes
- A.6: Exercises
B
- Metric Spaces
- B.1: Basic Definitions
- B.2: Limits
- B.3: Continuity
- B.4: Euclidean Space
- B.5: Growth of Functions
- B.6: Notes
- B.7: Exercises
C
- Linear Algebra
- C.1: Basic Definitions
- C.2: Rank
- C.3: Eigenvalues and Determinant
- C.4: Semidefinite Matrices
- C.5: SVD
- C.6: Notes
- C.7: Exercises
D
- Differentiation
- D.1: Scalar Differentiation
- D.2: Power and Taylor Series
- D.3: Notes
- D.4: Exercises
E
- Measure Theory
- E.1: Sigma Algebras
- E.2: Measure Function
- E.3: Extension Theorem
- E.4: Independence
- E.5: Important Measures
- E.6: Measurable Functions
- E.7: Notes
F

$ \def\P{\mathsf{P}} \def\R{\mathbb{R}} \def\defeq{\stackrel{\tiny\text{def}}{=}} \def\c{\,|\,} $

1.5. Conditional Probability and Independence

Definition 1.5.1. The conditional probability of an event $A$ given an event $B$ with $\P(B) > 0$ is \[ \P(A \c B) = \frac{\P(A\cap B)}{\P(B)}.\]

If $\P(A)>0$ and $\P(B)>0$ we have \[\P(A\cap B)=\P(A \c B)\P(B)=\P(B \c A)\P(A).\]

Intuitively, $\P(A \c B)$ is the probability of $A$ occurring assuming that the event $B$ occurred. In accordance with that intuition, the conditional probability has the following properties.

If $B\subset A$, then $\P(A \c B)=\P(B)/\P(B)=1$.
If $A\cap B=\emptyset$, then $\P(A \c B)=0/\P(B)=0$.
If $A\subset B$ then $\P(A \c B)=\P(A)/\P(B)$.
The conditional probability may be viewed as a probability function \[\P_A(E) \defeq \P(E \c A)\] satisfying Definition 1.2.1 (Exercise 7). In addition, all the properties and intuitions that apply to probability functions apply to $\P_A$ as well.
Assuming the event $A$ occurred, $\P_A$ generally has better forecasting abilities than $\P$.

As mentioned above, conditional probabilities are usually intuitive. The following example from (Feller, 1968), however, shows a counter-intuitive situation involving conditional probabilities. This demonstrates that intuition should not be a substitute for rigorous computation.

Example 1.5.1. Consider families with two children where the gender probability of each child is symmetric (1/2). We select a family at random and consider the sample space describing the gender of the children $\Omega=\{MM,MF,FM,FF\}$. We assume a classical model, implying that the probabilities of all 4 elementary events are 1/4.

We define the event that both children in the family are boys as $A=\{MM\}$, the event that a family has a boy as $B=\{MF,FM,MM\}$, and the event that the first child is a boy as $C=\{MF,MM\}$.

Given that the first child is a boy, the probability that both children are boys is \[\P(A \c C)=\P(A\cap C)/\P(C)=\P(A)/\P(C)=(1/4)/(1/2)=1/2.\] This matches our intuition. Given that the family has a boy, the probability that both children are boys is the counterintuitive \[\P(A \c B)=\P(A\cap B)/\P(B)=\P(A)/\P(B)=(1/4)/(3/4)=1/3.\]

Definition 1.5.2. Two events $A,B$ are independent if $\P(A\cap B)=\P(A)\P(B)$. A finite number of events $A_1,\ldots,A_n$ are independent if \[ \P(A_1\cap\cdots A_n)=\P(A_1)\cdots \P(A_n)\] and are pairwise independent if every pair $A_i, A_j, i\neq j$ are independent.

The following definition generalizes independence to an arbitrary collection of events, indexed by a (potentially infinite) set $\Theta$.

Definition 1.5.3. Multiple events $A_{\theta}, \theta\in\Theta$ are pairwise independent if every pair of events is independent. Multiple events $A_{\theta}, \theta\in\Theta$ are independent if for every $k>0$ and for every size $k$-subset of distinct events $A_{\theta_1},\ldots,A_{\theta_k}$, we have \[\P(A_{\theta_1}\cap\ldots\cap A_{\theta_k})=\P(A_{\theta_1})\cdots \P(A_{\theta_k}).\]

Note that pairwise independence is a strictly weaker condition than independence.

In agreement with our intuition, conditioning on an event that is independent of $A$ does not modify the probability of $A$: \[\P(A \c B)=\frac{\P(A)\P(B)}{\P(B)}=\P(A).\] On the other hand, two disjoint events cannot occur simultaneously and should therefore be dependent. Indeed, in this case $\P(A \c B)=0 \neq \P(A)$ (assuming that $\P(A)$ and $\P(B)$ are non-zero).

Example 1.5.2. We consider a random experiment of throwing two dice independently and denote by $A$ the event that the first throw resulted in 1, by $B$ the event that the sum in both throws is 3, and by $C$ the event that the second throw was even. Assuming the classical model, the events $A,B$ are dependent \[ \P(A\cap B)=\P(B \c A)\P(A)= (1/6)(1/6) \neq (1/6)(2/36) = \P(A)\P(B),\] while $A$ and $C$ are independent \[ \P(A\cap C)=\P(C \c A)\P(A)=(1/2)(1/6)=\P(A)\P(C).\]

Proposition 1.5.1. If $A,B$ are independent, then so are the events $A^c,B$, the events $A,B^c$, and the events $A^c,B^c$.

Proof. For example, \begin{align*} \P(A^c\cap B) &= \P(B\setminus A)=\P(B)-\P(A\cap B)=\P(B)-\P(A)\P(B)\\ &=(1-\P(A))\P(B)=\P(A^c)\P(B). \end{align*} The other parts of the proof are similar.

Proposition 1.5.2 (Bayes Theorem). If $\P(B)\neq 0$ and $\P(A)\neq 0$, then \[ \P(A \c B)=\frac{\P(B \c A)\P(A)}{\P(B)}.\]

Proof. \[ \P(A \c B)\P(B)=\P(A\cap B)=\P(B\cap A)=\P(B \c A)\P(A).\]

Example 1.5.3. We consider the following imaginary voting pattern of a group of 100 Americans, classified according to their party and whether they live in a city or a small town. The last row and last column capture the sum of the columns and the sum of the rows, respectively.

	City	Small Town	Total
Democrats	30	15	45
Rebpublicans	20	35	55
Total	50	50	100

We consider the experiment of drawing a person at random and observing the vote. The sample space contains 100 elementary events and we assume a classical model, implying that each person may be selected with equal $1/100$ probability.

Defining $A$ as the event that a person selected at random lives in the city, and $B$ as the event that a person selected at random is a democrat, we have \begin{align*} \P(A\cap B) &=30/100\\ \P(A^c\cap B) &=15/100\\ \P(A\cap B^c) &=20/100\\ \P(A^c\cap B^c)&=35/100\\ \P(A) &=50/100\\ \P(B) &=45/100\\ \P(A \c B) &=0.3/0.45\\ \P(A \c B^c) &=0.2/0.55\\ \P(B \c A) &=0.3/0.5\\ \P(B \c A^c) &=0.15/0.5. \end{align*} Since $A,B$ are dependent, conditioning on city dwelling raises the probability that a randomly drawn person is democrat from $\P(B)=0.45$ to $\P(B \c A)=0.6$.

Proposition 1.5.3 (General Multiplication Rule). \begin{align*} \P(A_1\cap \cdots\cap A_n)=\P(A_1)\P(A_2 \c A_1)\P(A_3 \c A_2\cap A_1)\cdots \P(A_n \c A_1\cap\cdots \cap A_{n-1}). \end{align*}

Proof. Using induction and $\P(A\cap B)=\P(A \c B)\P(B)$, we get \begin{align*} \P(A_1\cap \cdots\cap A_n)&=\P(A_n \c A_1\cap\cdots \cap A_{n-1})\P(A_1\cap\cdots \cap A_{n-1})\\ &=\cdots\\ &=\P(A_1)\P(A_2 \c A_1)\P(A_3 \c A_2\cap A_1)\cdots \P(A_n \c A_1\cap\cdots \cap A_{n-1}). \end{align*}

Proposition 1.5.4 (The Law of Total Probability). If $A_i, i\in S$, form a finite or countably infinite partition of $\Omega$ (see Section A.1 for a definition of a partition) \[ \P(B)=\sum_{i\in S} \P(A_i)\P(B \c A_i).\]

Proof. The partition $A_i, i\in S$, of $\Omega$ induces a partition $B\cap A_i$, $i\in S$, of $B$. The result follows from countable additivity (third probability axiom) or finite additivity applied to that partition \[\P(B)=\P\left( \bigcup_{i\in S} (B\cap A_i) \right)= \sum_{i\in S} \P(A_i\cap B)=\sum_{i\in S} \P(A_i)\P(B \c A_i).\]

The figure below illustrates the above proposition and its proof.

Figure 1.5.1: The partition $A_1,\ldots,A_4$ of $\Omega$ induces a partition $B\cap A_i$, $i=1,\ldots,4$ of $B$ (see Proposition 1.5.4).

The definition below extends the notion of independence to multiple experiments.

Definition 1.5.4. Consider $n$ random experiments with sample spaces $\Omega_1,\ldots, \Omega_n$. The set $\Omega=\Omega_1\times\cdots\times \Omega_n$ (see Chapter A for a definition of the cartesian product $\times$) is the sample space expressing all possible results of the experiments. The experiments are independent if for all sets $A_1\times\cdots\times A_n$ with $A_i\subset\Omega_i$, \[ \P(A_1\times\cdots\times A_n) = \P(A_1)\cdots \P(A_n).\]

In the equation above, the probability function on the left hand side is defined on $\Omega_1\times\cdots\times \Omega_n$ and the probability functions on the right hand side are defined on $\Omega_i$, $i=1,\ldots,n$.

Example 1.5.4. In two independent die throwing experiments $\Omega=\{1,\ldots,6\}\times\{1,\ldots,6\}$ and \begin{align*} \P(\text{first die is 3, second die is 4})&=\P(\text{first die is 3})\P(\text{second die is 4})\\ &=\frac{1}{6}\cdot\frac{1}{6}=\frac{1}{36}.\end{align*}

Chapter 4 contains an extended discussion of probabilities associated with multiple experiments.