Important Random Variables: The Hypergeometric Distribution

Probability

The Analysis of Data, volume 1

0
- Front Matter
- 0.1: Contents
- 0.2: Preface
1
2
- Random Variables
- 2.1: Basic Definitions
- 2.2: Functions of RVs
- 2.3: Expectation and Variance
- 2.4: Moments and MGF
- 2.5: RVs and Measure Theory
- 2.6: Notes
- 2.7: Exercises
3
4
5
- Important Vectors
- 5.1: Multinomial Vectors
- 5.2: Gaussian Vectors
- 5.3: Dirichlet Vectors
- 5.4: Mixture Vectors
- 5.5: Exponential Family
- 5.6: Notes
- 5.7: Exercises
6
- Random Processes
- 6.1: Basic Definitions
- 6.2: Marginals
- 6.3: Moments
- 6.4: Random Walk
- 6.5: Processes and Measure
- 6.6: Borell-Cantelli and Zero-One
- 6.7: Notes
- 6.8: Exercises
7
- Important RPs
- 7.1: Markov Chains
- 7.2: Poisson Process
- 7.3: Gaussian Process
- 7.4: Notes
- 7.5: Exercises
8
A
- Set Theory
- A.1: Basic Definition
- A.2: Functions
- A.3: Cardinality
- A.4: Limits of Sets
- A.5: Notes
- A.6: Exercises
B
- Metric Spaces
- B.1: Basic Definitions
- B.2: Limits
- B.3: Continuity
- B.4: Euclidean Space
- B.5: Growth of Functions
- B.6: Notes
- B.7: Exercises
C
- Linear Algebra
- C.1: Basic Definitions
- C.2: Rank
- C.3: Eigenvalues and Determinant
- C.4: Semidefinite Matrices
- C.5: SVD
- C.6: Notes
- C.7: Exercises
D
- Differentiation
- D.1: Scalar Differentiation
- D.2: Power and Taylor Series
- D.3: Notes
- D.4: Exercises
E
- Measure Theory
- E.1: Sigma Algebras
- E.2: Measure Function
- E.3: Extension Theorem
- E.4: Independence
- E.5: Important Measures
- E.6: Measurable Functions
- E.7: Notes
F

$ \def\P{\mathsf{\sf P}} \def\E{\mathsf{\sf E}} \def\Var{\mathsf{\sf Var}} \def\Cov{\mathsf{\sf Cov}} \def\std{\mathsf{\sf std}} \def\Cor{\mathsf{\sf Cor}} \def\R{\mathbb{R}} \def\c{\,|\,} \def\bb{\boldsymbol} \def\diag{\mathsf{\sf diag}} $

3.4. The Hypergeometric Distribution

The hypergeometric RV, $X\sim \text{Hyp}(m,n,k)$, where $m,n,k\in\mathbb{N}\cup\{0\}$, describes the number of white balls drawn from an urn with $m$ white balls and $n$ green balls after $k$ draws without replacement. It is similar to a binomial RV in that it counts the number of successes in a sequence of experiments, but in the case of the hypergeometric distribution, the probability of success in each experiment or draw changes depending on the previous draws, rather than being constant (the number of white and green balls in the urn changes as balls are drawn from the urn).

The pmf of $X\sim \text{Hyp}(m,n,k)$ is \begin{align*} p_X(x) &= \begin{cases}\frac{\begin{pmatrix}m\\x\end{pmatrix}\begin{pmatrix}n\\k-x\end{pmatrix}}{\begin{pmatrix}m+n\\k\end{pmatrix}} & x\in\{0,1,\ldots,m\}\\ 0 & \text{otherwise}\end{cases}. \end{align*} The above formula can be derived by noting that the numerator counts the number of possible draws having $x$ white and $k-x$ green balls, and the denominator counts the total number of draws. Assuming the classical model (see Section 1.3), the ratio above corresponds to the required probability.

The R code below graphs the pmfs of three hypergeometric RVs with multiple parameter values.

m = 15
n = 10
x = 0:18
D = stack(list(`$k=15$` = dhyper(x, m, n, 15), `$k=23$` = dhyper(x,
    m, n, 23), `$k=25$` = dhyper(x, m, n, 25)))
names(D) = c("mass", "k")
D$x = x
qplot(x, mass, data = D, geom = "point", stat = "identity",
    facets = k ~ ., xlab = "$x$", ylab = "$p_X(x)$",
    main = "Hypergeometric pmf ($m=15, n=10$)") + geom_linerange(aes(x = x,
    ymin = 0, ymax = mass))

If the number of balls $m+n$ is much larger than $k$, the hypergeometric distribution converges to the binomial distribution (sampling without replacement is similar to sampling with replacement if the number of drawn balls is much smaller than the number of balls in the urn). This is illustrated by the first row in the figure above, which displays behavior similar to the corresponding binomial. The second and third rows show a strong deviation from the binomial model.