Probability Note Date: 2024/05/26
Last Updated: 2024-08-16T15:42:43.994Z
Categories:
Probability Tags:
Probability Read Time: 16 minutes
A random variable is a function that maps the outcomes Ω \Omega Ω of a random process to numerical values R \mathbb{R} R . If the Ω \Omega Ω is discrete, the random variable is called a discrete random variable. If the Ω \Omega Ω is continuous, the random variable is called a continuous random variable.
The cumulative distribution function (CDF) of a random variable X X X is defined as
F ( x ) = P ( X ≤ x ) F(x) = P(X \leq x) F ( x ) = P ( X ≤ x )
The probability mass function (PMF) of a discrete random variable X X X is defined as
p ( x ) = P ( X = x ) p(x) = P(X = x) p ( x ) = P ( X = x )
The probability density function (PDF) of a continuous random variable X X X is defined as
f ( x ) = d F ( x ) d x f(x) = \frac{dF(x)}{dx} f ( x ) = d x d F ( x )
where F ( x ) F(x) F ( x ) is the CDF of X X X .
The expectation of a random variable X X X is defined as
E [ X ] = ∑ x x p ( x ) for discrete random variables E[X] = \sum_{x} x p(x) \quad \text{for discrete random variables} E [ X ] = x ∑ x p ( x ) for discrete random variables
where p ( x ) p(x) p ( x ) is the PMF of X X X .
E [ X ] = ∫ − ∞ ∞ x f ( x ) d x for continuous random variables E[X] = \int_{-\infty}^{\infty} x f(x) dx \quad \text{for continuous random variables} E [ X ] = ∫ − ∞ ∞ x f ( x ) d x for continuous random variables
where f ( x ) f(x) f ( x ) is the PDF of X X X .
The variance of a random variable X X X is defined as
Var ( X ) = E [ ( X − E [ X ] ) 2 ] \text{Var}(X) = E[(X - E[X])^2] Var ( X ) = E [( X − E [ X ] ) 2 ]
By the definition of variance, the variance is always non-negative.
Alternatively, the variance can be calculated as
Var ( X ) = E [ X 2 ] − E [ X ] 2 \text{Var}(X) = E[X^2] - E[X]^2 Var ( X ) = E [ X 2 ] − E [ X ] 2
E [ a X + b ] = a E [ X ] + b E[aX + b] = aE[X] + b E [ a X + b ] = a E [ X ] + b
E [ a X + b Y ] = a E [ X ] + b E [ Y ] E[aX + bY] = aE[X] + bE[Y] E [ a X + bY ] = a E [ X ] + b E [ Y ]
Var ( a X + b ) = a 2 Var ( X ) \text{Var}(aX + b) = a^2 \text{Var}(X) Var ( a X + b ) = a 2 Var ( X )
Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y )
where a a a and b b b are constants, and Cov ( X , Y ) \text{Cov}(X, Y) Cov ( X , Y ) is the covariance between X X X and Y Y Y .
The n-th moment of a random variable X X X is defined as
E [ X n ] = ∑ x x n p ( x ) for discrete random variables E[X^n] = \sum_{x} x^n p(x) \quad \text{for discrete random variables} E [ X n ] = x ∑ x n p ( x ) for discrete random variables
The standard deviation of a random variable X X X is defined as
SD ( X ) = Var ( X ) \text{SD}(X) = \sqrt{\text{Var}(X)} SD ( X ) = Var ( X )
The Bernoulli distribution is a discrete distribution with two possible outcomes: 0 and 1. The PMF of a Bernoulli random variable X X X is defined as
p ( x ) = { p if x = 1 1 − p if x = 0 p(x) = \begin{cases}
p & \text{if } x = 1 \\
1 - p & \text{if } x = 0
\end{cases} p ( x ) = { p 1 − p if x = 1 if x = 0
where p p p is the probability of success.
The expectation and variance of a Bernoulli random variable X X X are
E [ X ] = p and Var ( X ) = p ( 1 − p ) E[X] = p \quad \text{and} \quad \text{Var}(X) = p(1 - p) E [ X ] = p and Var ( X ) = p ( 1 − p )
The Binomial distribution is a discrete distribution that models the number of successes in a fixed number of independent Bernoulli trials. The PMF of a Binomial random variable X X X is defined as
p ( x ) = ( n x ) p x ( 1 − p ) n − x p(x) = \binom{n}{x} p^x (1 - p)^{n - x} p ( x ) = ( x n ) p x ( 1 − p ) n − x
where n n n is the number of trials, x x x is the number of successes, and p p p is the probability of success.
The expectation and variance of a Binomial random variable X X X are
E [ X ] = n p and Var ( X ) = n p ( 1 − p ) E[X] = np \quad \text{and} \quad \text{Var}(X) = np(1 - p) E [ X ] = n p and Var ( X ) = n p ( 1 − p )
The Poisson distribution is a discrete distribution that models the number of events occurring in a fixed interval of time or space. The PMF of a Poisson random variable X X X is defined as
p ( x ) = λ x e − λ x ! p(x) = \frac{\lambda^x e^{-\lambda}}{x!} p ( x ) = x ! λ x e − λ
where λ \lambda λ is the average rate of events.
The expectation and variance of a Poisson random variable X X X are
E [ X ] = λ and Var ( X ) = λ E[X] = \lambda \quad \text{and} \quad \text{Var}(X) = \lambda E [ X ] = λ and Var ( X ) = λ
The Geometric distribution is a discrete distribution that models the number of trials needed to achieve the first success in a sequence of independent Bernoulli trials. The PMF of a Geometric random variable X X X is defined as
p ( x ) = ( 1 − p ) x − 1 p p(x) = (1 - p)^{x - 1} p p ( x ) = ( 1 − p ) x − 1 p
where x x x is the number of trials needed to achieve the first success, and p p p is the probability of success.
The expectation and variance of a Geometric random variable X X X are
E [ X ] = 1 p and Var ( X ) = 1 − p p 2 E[X] = \frac{1}{p} \quad \text{and} \quad \text{Var}(X) = \frac{1 - p}{p^2} E [ X ] = p 1 and Var ( X ) = p 2 1 − p
The Uniform distribution is a continuous distribution with a constant probability density function (PDF) over a fixed interval. The PDF of a Uniform random variable X X X is defined as
f ( x ) = { 1 b − a if a ≤ x ≤ b 0 otherwise f(x) = \begin{cases}
\frac{1}{b - a} & \text{if } a \leq x \leq b \\
0 & \text{otherwise}
\end{cases} f ( x ) = { b − a 1 0 if a ≤ x ≤ b otherwise
where a a a and b b b are the lower and upper bounds of the interval.
The expectation and variance of a Uniform random variable X X X are
E [ X ] = a + b 2 and Var ( X ) = ( b − a ) 2 12 E[X] = \frac{a + b}{2} \quad \text{and} \quad \text{Var}(X) = \frac{(b - a)^2}{12} E [ X ] = 2 a + b and Var ( X ) = 12 ( b − a ) 2
The Exponential distribution is a continuous distribution that models the time between events in a Poisson process. The PDF of an Exponential random variable X X X is defined as
f ( x ) = λ e − λ x f(x) = \lambda e^{-\lambda x} f ( x ) = λ e − λ x
where λ \lambda λ is the rate parameter.
The expectation and variance of an Exponential random variable X X X are
E [ X ] = 1 λ and Var ( X ) = 1 λ 2 E[X] = \frac{1}{\lambda} \quad \text{and} \quad \text{Var}(X) = \frac{1}{\lambda^2} E [ X ] = λ 1 and Var ( X ) = λ 2 1
The Normal distribution is a continuous distribution that is symmetric and bell-shaped. The PDF of a Normal random variable X X X is defined as
f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x - \mu)^2}{2\sigma^2}} f ( x ) = 2 π σ 1 e − 2 σ 2 ( x − μ ) 2
where μ \mu μ is the mean and σ \sigma σ is the standard deviation.
The expectation and variance of a Normal random variable X X X are
E [ X ] = μ and Var ( X ) = σ 2 E[X] = \mu \quad \text{and} \quad \text{Var}(X) = \sigma^2 E [ X ] = μ and Var ( X ) = σ 2
We usually write X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) to denote that X X X follows a Normal distribution with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 .
The Gamma distribution is a continuous distribution that generalizes the Exponential distribution. The PDF of a Gamma random variable X X X is defined as
f ( x ) = λ k x k − 1 e − λ x Γ ( k ) f(x) = \frac{\lambda^k x^{k - 1} e^{-\lambda x}}{\Gamma(k)} f ( x ) = Γ ( k ) λ k x k − 1 e − λ x
where λ \lambda λ is the rate parameter, k k k is the shape parameter, and Γ ( k ) \Gamma(k) Γ ( k ) is the gamma function.
Γ ( k ) = ∫ 0 ∞ x k − 1 e − x d x \Gamma(k) = \int_{0}^{\infty} x^{k - 1} e^{-x} dx Γ ( k ) = ∫ 0 ∞ x k − 1 e − x d x
The expectation and variance of a Gamma random variable X X X are
E [ X ] = k λ and Var ( X ) = k λ 2 E[X] = \frac{k}{\lambda} \quad \text{and} \quad \text{Var}(X) = \frac{k}{\lambda^2} E [ X ] = λ k and Var ( X ) = λ 2 k
The Beta distribution is a continuous distribution that is defined on the interval [0, 1]. The PDF of a Beta random variable X X X is defined as
f ( x ) = x α − 1 ( 1 − x ) β − 1 B ( α , β ) f(x) = \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)} f ( x ) = B ( α , β ) x α − 1 ( 1 − x ) β − 1
where α \alpha α and β \beta β are the shape parameters, and B ( α , β ) B(\alpha, \beta) B ( α , β ) is the beta function.
B ( α , β ) = Γ ( α ) Γ ( β ) Γ ( α + β ) B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)} B ( α , β ) = Γ ( α + β ) Γ ( α ) Γ ( β )
The expectation and variance of a Beta random variable X X X are
E [ X ] = α α + β and Var ( X ) = α β ( α + β ) 2 ( α + β + 1 ) E[X] = \frac{\alpha}{\alpha + \beta} \quad \text{and} \quad \text{Var}(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} E [ X ] = α + β α and Var ( X ) = ( α + β ) 2 ( α + β + 1 ) α β
A bivariate distribution is a probability distribution that describes the joint behaviour of two random variables.
Given two random variables X X X and Y Y Y , the joint probability mass function (PMF) for discrete random variables is defined as
p ( x , y ) = P ( X = x , Y = y ) p(x, y) = P(X = x, Y = y) p ( x , y ) = P ( X = x , Y = y )
The marginal probability mass function (PMF) of a random variable X X X is defined as
p X ( x ) = ∑ y p ( x , y ) p_X(x) = \sum_{y} p(x, y) p X ( x ) = y ∑ p ( x , y )
The conditional probability mass function (PMF) of a random variable X X X given Y = y Y = y Y = y is defined as
p X ∣ Y ( x ∣ y ) = p ( x , y ) p Y ( y ) p_{X|Y}(x|y) = \frac{p(x, y)}{p_Y(y)} p X ∣ Y ( x ∣ y ) = p Y ( y ) p ( x , y )
The expectation of a bivariate distribution is defined as
E [ g ( X , Y ) ] = ∑ x ∑ y g ( x , y ) p ( x , y ) E[g(X, Y)] = \sum_{x} \sum_{y} g(x, y) p(x, y) E [ g ( X , Y )] = x ∑ y ∑ g ( x , y ) p ( x , y )
The covariance of two random variables X X X and Y Y Y is defined as
Cov ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])]
Covariance can sometimes be negative, zero, or positive.
And can be calculated as
Cov ( X , Y ) = E [ X Y ] − E [ X ] E [ Y ] \text{Cov}(X, Y) = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [ X Y ] − E [ X ] E [ Y ]
The correlation coefficient of two random variables X X X and Y Y Y is defined as
ρ ( X , Y ) = Cov ( X , Y ) Var ( X ) Var ( Y ) \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} ρ ( X , Y ) = Var ( X ) Var ( Y ) Cov ( X , Y )
We can prove that − 1 ≤ ρ ( X , Y ) ≤ 1 -1 \leq \rho(X, Y) \leq 1 − 1 ≤ ρ ( X , Y ) ≤ 1 .
Two random variables X X X and Y Y Y are independent if and only if
p ( x , y ) = p X ( x ) p Y ( y ) p(x, y) = p_X(x) p_Y(y) p ( x , y ) = p X ( x ) p Y ( y )
for all x x x and y y y .
If X X X and Y Y Y are independent, then
E [ X Y ] = E [ X ] E [ Y ] and Cov ( X , Y ) = 0 E[XY] = E[X]E[Y] \quad \text{and} \quad \text{Cov}(X, Y) = 0 E [ X Y ] = E [ X ] E [ Y ] and Cov ( X , Y ) = 0
Two random variables X X X and Y Y Y are uncorrelated if and only if
Cov ( X , Y ) = 0 \text{Cov}(X, Y) = 0 Cov ( X , Y ) = 0
If X X X and Y Y Y are uncorrelated, then
E [ X Y ] = E [ X ] E [ Y ] E[XY] = E[X]E[Y] E [ X Y ] = E [ X ] E [ Y ]
Note: Uncorrelated random variables are not necessarily independent.
Given two random variables X X X and Y Y Y , the joint cumulative distribution function (CDF) for continuous random variables is defined as
F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y ) F_{X,Y}(x, y) = P(X \leq x, Y \leq y) F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y )
The marginal cumulative distribution function (CDF) of a random variable X X X is defined as
F X ( x ) = P ( X ≤ x ) = P ( X ≤ x , Y < ∞ ) = F X , Y ( x , ∞ ) F_X(x) = P(X \leq x) = P(X \leq x, Y < \infty) = F_{X,Y}(x, \infty) F X ( x ) = P ( X ≤ x ) = P ( X ≤ x , Y < ∞ ) = F X , Y ( x , ∞ )
Given two random variables X X X and Y Y Y , if there exists a function f ( x , y ) f(x, y) f ( x , y ) such that
P ( ( X , Y ) ∈ A ) = ∬ A f ( x , y ) d x d y P((X, Y) \in A) = \iint_{A} f(x, y) dx dy P (( X , Y ) ∈ A ) = ∬ A f ( x , y ) d x d y
for all Lebesgue-measurable sets A A A , then f ( x , y ) f(x, y) f ( x , y ) is the joint probability density function (PDF) of X X X and Y Y Y . And X X X and Y Y Y are called jointly continuous random variables.
By the definition of the joint PDF, we have
F X , Y ( x , y ) = ∫ − ∞ x ∫ − ∞ y f ( u , v ) d u d v F_{X,Y}(x, y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f(u, v) du dv F X , Y ( x , y ) = ∫ − ∞ x ∫ − ∞ y f ( u , v ) d u d v
And
f ( x , y ) = ∂ 2 F X , Y ( x , y ) ∂ x ∂ y f(x, y) = \frac{\partial^2 F_{X,Y}(x, y)}{\partial x \partial y} f ( x , y ) = ∂ x ∂ y ∂ 2 F X , Y ( x , y )
The marginal probability density function (PDF) of a random variable X X X is defined as
f X ( x ) = d F X ( x ) d x = ∫ − ∞ ∞ f ( x , y ) d y f_X(x) = \frac{dF_X(x)}{dx} = \int_{-\infty}^{\infty} f(x, y) dy f X ( x ) = d x d F X ( x ) = ∫ − ∞ ∞ f ( x , y ) d y
Given n n n random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n ,
let the vector X = ( X 1 , X 2 , … , X n ) \mathbf{X} = (X_1, X_2, \ldots, X_n) X = ( X 1 , X 2 , … , X n ) ,
the joint cumulative distribution function (CDF) for continuous random variables is defined as
F X ( x ) = P ( X < x ) F_\mathbf{X}(\mathbf{x}) = P(\mathbf{X} < \mathbf{x}) F X ( x ) = P ( X < x )
Given n n n random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n ,
let the vector X = ( X 1 , X 2 , … , X n ) \mathbf{X} = (X_1, X_2, \ldots, X_n) X = ( X 1 , X 2 , … , X n ) ,
if there exists a function f ( x 1 , x 2 , … , x n ) f(x_1, x_2, \ldots, x_n) f ( x 1 , x 2 , … , x n ) such that
P ( X ∈ A ) = ∫ A f ( x ) d X P(\mathbf{X} \in A) = \int_{A} f(\mathbf{x}) d\mathbf{X} P ( X ∈ A ) = ∫ A f ( x ) d X
for all Lebesgue-measurable sets A A A , then f ( x ) f(\mathbf{x}) f ( x ) is the joint probability density function (PDF) of X \mathbf{X} X . And X \mathbf{X} X are called jointly continuous random variables.
By the definition of the joint PDF, we have
F X ( x ) = ∫ − ∞ x 1 ∫ − ∞ x 2 … ∫ − ∞ x n f ( u 1 , u 2 , … , u n ) d u 1 d u 2 … d u n F_\mathbf{X}(\mathbf{x}) = \int_{-\infty}^{x_1} \int_{-\infty}^{x_2} \ldots \int_{-\infty}^{x_n} f(u_1, u_2, \ldots, u_n) du_1 du_2 \ldots du_n F X ( x ) = ∫ − ∞ x 1 ∫ − ∞ x 2 … ∫ − ∞ x n f ( u 1 , u 2 , … , u n ) d u 1 d u 2 … d u n
And
f ( x ) = ∂ n F X ( x ) ∂ x 1 ∂ x 2 … ∂ x n f(\mathbf{x}) = \frac{\partial^n F_\mathbf{X}(\mathbf{x})}{\partial x_1 \partial x_2 \ldots \partial x_n} f ( x ) = ∂ x 1 ∂ x 2 … ∂ x n ∂ n F X ( x )
The marginal probability density function (PDF) of a random variable X i X_i X i is defined as
F X k 1 , X k 2 , … , X k m ( x k 1 , x k 2 , … , x k m ) = ∫ − ∞ ∞ … ∫ − ∞ ∞ F ( x 1 , x 2 , … , x n ) ∏ j ≠ k i d x j F_{X_{k_1}, X_{k_2}, \ldots, X_{k_m}}(x_{k_1}, x_{k_2}, \ldots, x_{k_m}) =
\int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} F(x_1, x_2, \ldots, x_n)
\prod_{j \neq k_i}dx_j F X k 1 , X k 2 , … , X k m ( x k 1 , x k 2 , … , x k m ) = ∫ − ∞ ∞ … ∫ − ∞ ∞ F ( x 1 , x 2 , … , x n ) j = k i ∏ d x j
Two random variables X X X and Y Y Y are independent if and only if
F ( x , y ) = F X ( x ) F Y ( y ) F(x, y) = F_X(x) F_Y(y) F ( x , y ) = F X ( x ) F Y ( y )
for all x x x and y y y .
This can be thought as the joint behaviour of X X X and Y Y Y is the product of the marginal behaviour of X X X and Y Y Y .
The definition can also be formulated in terms of the joint PDF:
f ( x , y ) = f X ( x ) f Y ( y ) f(x, y) = f_X(x) f_Y(y) f ( x , y ) = f X ( x ) f Y ( y )
To show that two random variables are not independent, we only need to find one pair of x x x and y y y such that the equation does not hold.
Given two independent random variables X X X and Y Y Y , and a function g ( X ) g(X) g ( X ) , and a function h ( Y ) h(Y) h ( Y ) , the random variables Z = g ( X ) Z = g(X) Z = g ( X ) and W = h ( Y ) W = h(Y) W = h ( Y ) are also independent.
A set of random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n are mutually independent if and only if
F ( x 1 , x 2 , … , x n ) = F X 1 ( x 1 ) F X 2 ( x 2 ) … F X n ( x n ) F(x_1, x_2, \ldots, x_n) = F_{X_1}(x_1) F_{X_2}(x_2) \ldots F_{X_n}(x_n) F ( x 1 , x 2 , … , x n ) = F X 1 ( x 1 ) F X 2 ( x 2 ) … F X n ( x n )
Note: Mutual independence implies pairwise independence, however, the converse is not true.
A set of random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n are identically independent (IID) if and only if
They are mutually independent.
They have the same distribution.
Given two independent random variables X X X and Y Y Y , the sum of X X X and Y Y Y is defined as
Z = X + Y Z = X + Y Z = X + Y
Then, the CDF of Z Z Z can be calculated as
F Z ( z ) = P ( Z ≤ z ) = P ( X + Y ≤ z ) = ∫ − ∞ ∞ P ( X + Y ≤ z ∣ X = x ) f X ( x ) d x = ∫ − ∞ ∞ P ( Y ≤ z − x ) f X ( x ) d x = ∫ − ∞ ∞ F Y ( z − x ) f X ( x ) d x \begin{align}
F_Z(z) &= P(Z \leq z) \\
&= P(X + Y \leq z) \\
&= \int_{-\infty}^{\infty} P(X + Y \leq z | X = x) f_X(x) dx \\
&= \int_{-\infty}^{\infty} P(Y \leq z - x) f_X(x) dx \\
&= \int_{-\infty}^{\infty} F_Y(z - x) f_X(x) dx
\end{align} F Z ( z ) = P ( Z ≤ z ) = P ( X + Y ≤ z ) = ∫ − ∞ ∞ P ( X + Y ≤ z ∣ X = x ) f X ( x ) d x = ∫ − ∞ ∞ P ( Y ≤ z − x ) f X ( x ) d x = ∫ − ∞ ∞ F Y ( z − x ) f X ( x ) d x
The PDF of Z Z Z can be calculated as
f Z ( z ) = ∫ − ∞ ∞ f Y ( z − x ) f X ( x ) d x f_Z(z) = \int_{-\infty}^{\infty} f_Y(z - x) f_X(x) dx f Z ( z ) = ∫ − ∞ ∞ f Y ( z − x ) f X ( x ) d x
This is called the convolution of the PDFs of X X X and Y Y Y .
Let the vector X = ( X 1 , X 2 , … , X n ) \mathbf{X} = (X_1, X_2, \ldots, X_n) X = ( X 1 , X 2 , … , X n ) be a set of random variables,
and for Lebesgue-measurable functions g : R n → R g: \mathbb{R}^n \rightarrow \mathbb{R} g : R n → R
the expectation of X \mathbf{X} X is defined as
E [ g ( X ) ] = ∫ R n g ( x ) f ( x ) d X E[g(\mathbf{X})] = \int_{\mathbb{R}^n} g(\mathbf{x}) f(\mathbf{x}) d\mathbf{X} E [ g ( X )] = ∫ R n g ( x ) f ( x ) d X
E [ a g ( X ) + b h ( X ) + c ] = a E [ g ( X ) ] + b E [ h ( X ) ] + c E[a g(\mathbf{X}) + bh(\mathbf{X}) + c] = a E[g(\mathbf{X})] + b E[h(\mathbf{X})] + c E [ a g ( X ) + bh ( X ) + c ] = a E [ g ( X )] + b E [ h ( X )] + c
If X X X and Y Y Y are independent, then E [ g ( X ) h ( Y ) ] = E [ g ( X ) ] E [ h ( Y ) ] E[g(X)h(Y)] = E[g(X)]E[h(Y)] E [ g ( X ) h ( Y )] = E [ g ( X )] E [ h ( Y )] for any functions g g g and h h h .
The covariance of two random variables X X X and Y Y Y is defined as
Cov ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])]
The covariance can be calculated as
Cov ( X , Y ) = E [ X Y ] − E [ X ] E [ Y ] \text{Cov}(X, Y) = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [ X Y ] − E [ X ] E [ Y ]
Cov ( X , Y ) = Cov ( Y , X ) \text{Cov}(X, Y) = \text{Cov}(Y, X) Cov ( X , Y ) = Cov ( Y , X )
Cov ( X , X ) = Var ( X ) \text{Cov}(X, X) = \text{Var}(X) Cov ( X , X ) = Var ( X )
Cov ( a X + b , c Y + d ) = a c Cov ( X , Y ) \text{Cov}(aX + b, cY + d) = ac \text{Cov}(X, Y) Cov ( a X + b , c Y + d ) = a c Cov ( X , Y )
Cov ( ∑ a i X i , ∑ b i Y i ) = ∑ a i b j Cov ( X i , Y j ) \text{Cov}(\sum a_iX_{i}, \sum b_iY_{i}) = \sum a_i b_j \text{Cov}(X_i,Y_j) Cov ( ∑ a i X i , ∑ b i Y i ) = ∑ a i b j Cov ( X i , Y j )
Cov ( X , Y ) = E ( X Y ) − E ( X ) E ( Y ) \text{Cov}(X,Y) = E(XY) - E(X)E(Y) Cov ( X , Y ) = E ( X Y ) − E ( X ) E ( Y )
∣ Cov ( X , Y ) ∣ ≤ Var ( X ) Var ( Y ) |\text{Cov}(X, Y)| \leq \sqrt{\text{Var}(X) \text{Var}(Y)} ∣ Cov ( X , Y ) ∣ ≤ Var ( X ) Var ( Y )
This equality holds if and only if X X X and Y Y Y are linearly related.
The correlation coefficient of two random variables X X X and Y Y Y is defined as
ρ ( X , Y ) = Cov ( X , Y ) Var ( X ) Var ( Y ) \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} ρ ( X , Y ) = Var ( X ) Var ( Y ) Cov ( X , Y )
By Cauchy-Schwarz Inequality , we have − 1 ≤ ρ ( X , Y ) ≤ 1 -1 \leq \rho(X, Y) \leq 1 − 1 ≤ ρ ( X , Y ) ≤ 1 .
The n-th (raw) moment of a random variable X X X is defined as E [ X n ] E[X^n] E [ X n ] ,
and the n-th central moment of a random variable X X X is defined as E [ ( X − E [ X ] ) n ] E[(X - E[X])^n] E [( X − E [ X ] ) n ] .
The joint (raw) moment of random variables X X X and Y Y Y is defined as E [ X i Y j ] E[X^iY^j] E [ X i Y j ] ,
and the joint central moment of random variables X X X and Y Y Y is defined as E [ ( X − E [ X ] ) i ( Y − E [ Y ] ) j ] E[(X - E[X])^i(Y - E[Y])^j] E [( X − E [ X ] ) i ( Y − E [ Y ] ) j ] .
Given a set of random variables X = ( X 1 , X 2 , … , X n ) T \mathbf{X} = (X_1, X_2, \ldots, X_n)^T X = ( X 1 , X 2 , … , X n ) T ,
E [ X ] = ( E [ X 1 ] , E [ X 2 ] , … , E [ X n ] ) T E[\mathbf{X}] = (E[X_1], E[X_2], \ldots, E[X_n])^T E [ X ] = ( E [ X 1 ] , E [ X 2 ] , … , E [ X n ] ) T
which is a vector of n × 1 n\times1 n × 1 dimensions vector.
The covariance matrix of X \mathbf{X} X is defined as
Cov ( X ) = E [ ( X − E [ X ] ) ( X − E [ X ] ) T ] \text{Cov}(\mathbf{X}) = E[(\mathbf{X} - E[\mathbf{X}])(\mathbf{X} - E[\mathbf{X}])^T] Cov ( X ) = E [( X − E [ X ]) ( X − E [ X ] ) T ]
which is a n × n n\times n n × n matrix.
Given two random variables X X X and Y Y Y , the conditional probability mass function (PMF) of X X X given Y = y Y = y Y = y is defined as
p X ∣ Y ( x ∣ y ) = p ( x , y ) p Y ( y ) p_{X|Y}(x|y) = \frac{p(x, y)}{p_Y(y)} p X ∣ Y ( x ∣ y ) = p Y ( y ) p ( x , y )
Given two random variables X X X and Y Y Y , the conditional commutative distribution function (CDF) of X X X given Y = y Y = y Y = y is defined as
F X ∣ Y ( x ∣ y ) = P ( X ≤ x ∣ Y = y ) F_{X|Y}(x|y) = P(X \leq x | Y = y) F X ∣ Y ( x ∣ y ) = P ( X ≤ x ∣ Y = y )
Given two random variables X X X and Y Y Y , the conditional probability density function (PDF) of X X X given Y = y Y = y Y = y is defined as
f X ∣ Y ( x ∣ y ) = f ( x , y ) f Y ( y ) f_{X|Y}(x|y) = \frac{f(x, y)}{f_Y(y)} f X ∣ Y ( x ∣ y ) = f Y ( y ) f ( x , y )
Given two random variables X X X and Y Y Y , the conditional commutative distribution function (CDF) of X X X given Y = y Y = y Y = y is defined as
F X ∣ Y ( x ∣ y ) = P ( X ≤ x ∣ Y = y ) = ∫ − ∞ x f X ∣ Y ( u ∣ y ) d u F_{X|Y}(x|y) = P(X \leq x | Y = y) = \int_{-\infty}^{x} f_{X|Y}(u|y) du F X ∣ Y ( x ∣ y ) = P ( X ≤ x ∣ Y = y ) = ∫ − ∞ x f X ∣ Y ( u ∣ y ) d u
Given two random variables X X X and Y Y Y , the conditional expectation of X X X given Y = y Y = y Y = y is defined as
E [ X ∣ Y = y ] = ∑ x x p X ∣ Y ( x ∣ y ) for discrete random variables E[X|Y = y] = \sum_{x} x p_{X|Y}(x|y) \quad \text{for discrete random variables} E [ X ∣ Y = y ] = x ∑ x p X ∣ Y ( x ∣ y ) for discrete random variables
E [ X ∣ Y = y ] = ∫ − ∞ ∞ x f X ∣ Y ( x ∣ y ) d x for continuous random variables E[X|Y = y] = \int_{-\infty}^{\infty} x f_{X|Y}(x|y) dx \quad \text{for continuous random variables} E [ X ∣ Y = y ] = ∫ − ∞ ∞ x f X ∣ Y ( x ∣ y ) d x for continuous random variables
We can also define function of Y Y Y as
ψ ( y ) = E [ X ∣ Y = y ] \psi(y) = E[X|Y = y] ψ ( y ) = E [ X ∣ Y = y ]
This is a random variable, and we call this the conditional expectation of X X X given Y Y Y .
Given two random variables X X X and Y Y Y , the law of iterated expectations states that
E [ E [ X ∣ Y ] ] = E [ X ] E[E[X|Y]] = E[X] E [ E [ X ∣ Y ]] = E [ X ]
Proof:
E [ E [ X ∣ Y ] ] = ∫ − ∞ ∞ E [ X ∣ Y = y ] f Y ( y ) d y = ∫ − ∞ ∞ ( ∫ − ∞ ∞ x f X ∣ Y ( x ∣ y ) d x ) f Y ( y ) d y = ∫ − ∞ ∞ ∫ − ∞ ∞ x f ( x , y ) d x d y = E [ X ] \begin{align}
E[E[X|Y]] &= \int_{-\infty}^{\infty} E[X|Y = y] f_Y(y) dy \\
&= \int_{-\infty}^{\infty} \left( \int_{-\infty}^{\infty} x f_{X|Y}(x|y) dx \right) f_Y(y) dy \\
&= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x f(x, y) dx dy \\
&= E[X]
\end{align} E [ E [ X ∣ Y ]] = ∫ − ∞ ∞ E [ X ∣ Y = y ] f Y ( y ) d y = ∫ − ∞ ∞ ( ∫ − ∞ ∞ x f X ∣ Y ( x ∣ y ) d x ) f Y ( y ) d y = ∫ − ∞ ∞ ∫ − ∞ ∞ x f ( x , y ) d x d y = E [ X ]
Given a random variable X X X and an event A A A , the law of total probability states that
P ( A ) = ∫ − ∞ ∞ P ( A ∣ X = x ) f X ( x ) d x P(A) = \int_{-\infty}^{\infty} P(A|X = x) f_X(x) dx P ( A ) = ∫ − ∞ ∞ P ( A ∣ X = x ) f X ( x ) d x
Given a random variable X X X and a stopping time N N N which is an integer valued random variable, then
E [ ∑ i = 1 N X ] = E [ X ] E [ N ] E[\sum_{i=1}^{N}X] = E[X]E[N] E [ i = 1 ∑ N X ] = E [ X ] E [ N ]
E [ a X + b Y + c ∣ Z ] = a E [ X ∣ Z ] + b E [ Y ∣ Z ] + c E[aX + bY + c|Z] = aE[X|Z] + bE[Y|Z] + c E [ a X + bY + c ∣ Z ] = a E [ X ∣ Z ] + b E [ Y ∣ Z ] + c
If Y > 0 Y > 0 Y > 0 , then E [ X ∣ Y ] > 0 E[X|Y] > 0 E [ X ∣ Y ] > 0 .
If X X X and Y Y Y are independent, then E [ X ∣ Y ] = E [ X ] E[X|Y] = E[X] E [ X ∣ Y ] = E [ X ] .
For any function g g g and h h h , E [ g ( X ) h ( Y ) ∣ Y ] = h ( Y ) E [ g ( X ) ∣ Y ] E[g(X)h(Y)|Y] = h(Y)E[g(X)|Y] E [ g ( X ) h ( Y ) ∣ Y ] = h ( Y ) E [ g ( X ) ∣ Y ] .
Given two random variables X X X and Y Y Y , the conditional variance of X X X given Y Y Y is defined as
Var ( X ∣ Y ) = E [ ( X − E [ X ∣ Y ] ) 2 ∣ Y ] \text{Var}(X|Y) = E[(X - E[X|Y])^2|Y] Var ( X ∣ Y ) = E [( X − E [ X ∣ Y ] ) 2 ∣ Y ]
The conditional variance can be calculated as
Var ( X ∣ Y ) = E [ X 2 ∣ Y ] − E [ X ∣ Y ] 2 \text{Var}(X|Y) = E[X^2|Y] - E[X|Y]^2 Var ( X ∣ Y ) = E [ X 2 ∣ Y ] − E [ X ∣ Y ] 2
Note that the conditional variance is a random variable of Y Y Y .
Given a random variable X X X and an event A A A , the law of total variance states that
Var ( X ) = E [ Var ( X ∣ Y ) ] + Var ( E [ X ∣ Y ] ) \text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E[X|Y]) Var ( X ) = E [ Var ( X ∣ Y )] + Var ( E [ X ∣ Y ])
Given a random variable X X X with a PDF f X ( x ) f_X(x) f X ( x ) , the support of f X ( x ) f_X(x) f X ( x ) is the set of values of x x x where f X ( x ) > 0 f_X(x) > 0 f X ( x ) > 0 .
Given a random variable X X X with a PDF f X ( x ) f_X(x) f X ( x ) , and a function Y = g ( X ) Y = g(X) Y = g ( X ) , if g g g is a monotonic function, then the CDF of Y Y Y is
F Y ( y ) = { F X ( g − 1 ( y ) ) if g is increasing 1 − F X ( g − 1 ( y ) ) if g is decreasing F_Y(y) = \begin{cases}
F_X(g^{-1}(y)) & \text{if } g \text{ is increasing} \\
1 - F_X(g^{-1}(y)) & \text{if } g \text{ is decreasing}
\end{cases} F Y ( y ) = { F X ( g − 1 ( y )) 1 − F X ( g − 1 ( y )) if g is increasing if g is decreasing
Then, the PDF of Y Y Y is
f Y ( y ) = f X ( g − 1 ( y ) ) ∣ d d y g − 1 ( y ) ∣ f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right| f Y ( y ) = f X ( g − 1 ( y )) d y d g − 1 ( y )
For non-monotonic transformations, we can break the transformation into monotonic parts.
Given random variables X 1 X_1 X 1 and X 2 X_2 X 2 with a joint PDF f X 1 , X 2 ( x 1 , x 2 ) f_{X_1, X_2}(x_1, x_2) f X 1 , X 2 ( x 1 , x 2 ) , and functions ( Y 1 , Y 2 ) = T ( X 1 , X 2 ) (Y_1, Y_2) = T(X_1, X_2) ( Y 1 , Y 2 ) = T ( X 1 , X 2 ) , where T : R 2 → R 2 T: \mathbb{R}^2 \rightarrow \mathbb{R}^2 T : R 2 → R 2 is a one-to-one transformation,
and let H = T − 1 H = T^{-1} H = T − 1 .
We define J H J_{H} J H , which is the Jacobian determinate of H H H as
J H = ∣ ∂ ( H 1 , H 2 ) ∂ ( x 1 , x 2 ) ∣ = det [ ∂ H 1 ∂ x 1 ∂ H 1 ∂ x 2 ∂ H 2 ∂ x 1 ∂ H 2 ∂ x 2 ] J_{H} = \left| \frac{\partial (H_1, H_2)}{\partial (x_1, x_2)} \right|
= \det\begin{bmatrix}
\frac{\partial H_1}{\partial x_1} & \frac{\partial H_1}{\partial x_2} \\
\frac{\partial H_2}{\partial x_1} & \frac{\partial H_2}{\partial x_2}
\end{bmatrix} J H = ∂ ( x 1 , x 2 ) ∂ ( H 1 , H 2 ) = det [ ∂ x 1 ∂ H 1 ∂ x 1 ∂ H 2 ∂ x 2 ∂ H 1 ∂ x 2 ∂ H 2 ]
Then, the joint PDF of ( Y 1 , Y 2 ) (Y_1, Y_2) ( Y 1 , Y 2 ) is
f Y 1 , Y 2 ( y 1 , y 2 ) = f X 1 , X 2 ( H 1 ( y 1 , y 2 ) , H 2 ( y 1 , y 2 ) ) ∣ J H ∣ f_{Y_1, Y_2}(y_1, y_2) = f_{X_1, X_2}(H_1(y_1, y_2), H_2(y_1, y_2)) |J_{H}| f Y 1 , Y 2 ( y 1 , y 2 ) = f X 1 , X 2 ( H 1 ( y 1 , y 2 ) , H 2 ( y 1 , y 2 )) ∣ J H ∣
Note: The Jacobian determinate satisfy: J H = J H − 1 − 1 = J T − 1 J_{H} = J_{H^{-1}}^{-1} = J_T^{-1} J H = J H − 1 − 1 = J T − 1 .
The theorem from Transformation of Bivariate Random Variables can be generalized to multiple random variables.
Given random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n with a joint PDF f X 1 , X 2 , … , X n ( x 1 , x 2 , … , x n ) f_{X_1, X_2, \ldots, X_n}(x_1, x_2, \ldots, x_n) f X 1 , X 2 , … , X n ( x 1 , x 2 , … , x n ) , and functions ( Y 1 , Y 2 , … , Y n ) = T ( X 1 , X 2 , … , X n ) (Y_1, Y_2, \ldots, Y_n) = T(X_1, X_2, \ldots, X_n) ( Y 1 , Y 2 , … , Y n ) = T ( X 1 , X 2 , … , X n ) , where T : R n → R n T: \mathbb{R}^n \rightarrow \mathbb{R}^n T : R n → R n is a one-to-one transformation,
and let H = T − 1 H = T^{-1} H = T − 1 .
We define J H J_{H} J H , which is the Jacobian determinate of H H H as
J H = ∣ ∂ ( H 1 , H 2 , … , H n ) ∂ ( x 1 , x 2 , … , x n ) ∣ = det [ ∂ H 1 ∂ x 1 ∂ H 1 ∂ x 2 … ∂ H 1 ∂ x n ∂ H 2 ∂ x 1 ∂ H 2 ∂ x 2 … ∂ H 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ H n ∂ x 1 ∂ H n ∂ x 2 … ∂ H n ∂ x n ] J_{H} = \left| \frac{\partial (H_1, H_2, \ldots, H_n)}{\partial (x_1, x_2, \ldots, x_n)} \right|
= \det\begin{bmatrix}
\frac{\partial H_1}{\partial x_1} & \frac{\partial H_1}{\partial x_2} & \ldots & \frac{\partial H_1}{\partial x_n} \\
\frac{\partial H_2}{\partial x_1} & \frac{\partial H_2}{\partial x_2} & \ldots & \frac{\partial H_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial H_n}{\partial x_1} & \frac{\partial H_n}{\partial x_2} & \ldots & \frac{\partial H_n}{\partial x_n}
\end{bmatrix} J H = ∂ ( x 1 , x 2 , … , x n ) ∂ ( H 1 , H 2 , … , H n ) = det ∂ x 1 ∂ H 1 ∂ x 1 ∂ H 2 ⋮ ∂ x 1 ∂ H n ∂ x 2 ∂ H 1 ∂ x 2 ∂ H 2 ⋮ ∂ x 2 ∂ H n … … ⋱ … ∂ x n ∂ H 1 ∂ x n ∂ H 2 ⋮ ∂ x n ∂ H n
Then, the joint PDF of ( Y 1 , Y 2 , … , Y n ) (Y_1, Y_2, \ldots, Y_n) ( Y 1 , Y 2 , … , Y n ) is
f Y 1 , Y 2 , … , Y n ( y 1 , y 2 , … , y n ) = f X 1 , X 2 , … , X n ( H 1 ( y 1 , y 2 , … , y n ) , H 2 ( y 1 , y 2 , … , y n ) , … , H n ( y 1 , y 2 , … , y n ) ) ∣ J H ∣ f_{Y_1, Y_2, \ldots, Y_n}(y_1, y_2, \ldots, y_n) = f_{X_1, X_2, \ldots, X_n}(H_1(y_1, y_2, \ldots, y_n), H_2(y_1, y_2, \ldots, y_n), \ldots, H_n(y_1, y_2, \ldots, y_n)) |J_{H}| f Y 1 , Y 2 , … , Y n ( y 1 , y 2 , … , y n ) = f X 1 , X 2 , … , X n ( H 1 ( y 1 , y 2 , … , y n ) , H 2 ( y 1 , y 2 , … , y n ) , … , H n ( y 1 , y 2 , … , y n )) ∣ J H ∣
Given a random variable X X X , the moment generating function (MGF) M X ( t ) M_X(t) M X ( t ) of X X X is defined as
M X ( t ) = E [ e t X ] M_X(t) = E[e^{tX}] M X ( t ) = E [ e tX ]
The domain of the MGF is the set of t t t such that M X ( t ) M_X(t) M X ( t ) exists and is finite.
If the domain does not contain an open neighbourhood of 0 0 0 ,
then we say the MGF does not exist.
Given a random variable X X X that follows the standard normal distribution ,
f X ( x ) = 1 2 π e − x 2 2 f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} f X ( x ) = 2 π 1 e − 2 x 2
By definition,
M X ( t ) = E [ e t X ] = ∫ − ∞ ∞ e t x f X ( x ) d x = ∫ − ∞ ∞ 1 2 π e t x − x 2 2 d x = 1 2 π ∫ − ∞ ∞ e t x − x 2 2 d x = 1 2 π ∫ − ∞ ∞ e − 1 2 ( x 2 − 2 t x ) d x = 1 2 π ∫ − ∞ ∞ e − 1 2 ( x − t ) 2 + t 2 2 d x = e t 2 2 \begin{align}
M_X(t) &= E[e^{tX}] \\
&= \int_{-\infty}^{\infty} e^{tx} f_X(x) dx \\
&= \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{tx-\frac{x^2}{2}} dx \\
&= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{tx-\frac{x^2}{2}} dx \\
&= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-\frac{1}{2}(x^2 - 2tx)} dx \\
&= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-\frac{1}{2}(x - t)^2 + \frac{t^2}{2}} dx \\
&= e^{\frac{t^2}{2}}
\end{align} M X ( t ) = E [ e tX ] = ∫ − ∞ ∞ e t x f X ( x ) d x = ∫ − ∞ ∞ 2 π 1 e t x − 2 x 2 d x = 2 π 1 ∫ − ∞ ∞ e t x − 2 x 2 d x = 2 π 1 ∫ − ∞ ∞ e − 2 1 ( x 2 − 2 t x ) d x = 2 π 1 ∫ − ∞ ∞ e − 2 1 ( x − t ) 2 + 2 t 2 d x = e 2 t 2
Given a random variable X X X that follows the exponential distribution ,
f X ( x ) = λ e − λ x f_X(x) = \lambda e^{-\lambda x} f X ( x ) = λ e − λ x
By definition,
M X ( t ) = E [ e t X ] = ∫ 0 ∞ e t x λ e − λ x d x = λ ∫ 0 ∞ e t x − λ x d x = λ t − λ ∫ 0 ∞ ( t − λ ) e ( t − λ ) x d x = λ t − λ \begin{align}
M_{X}(t) &= E[e^{tX}] \\
&= \int_{0}^{\infty} e^{tx} \lambda e^{-\lambda x} dx \\
&= \lambda \int_{0}^{\infty} e^{tx-\lambda x} dx \\
&= \frac{\lambda}{t-\lambda} \int_{0}^{\infty} (t-\lambda) e^{(t-\lambda)x} dx \\
&= \frac{\lambda}{t-\lambda}
\end{align} M X ( t ) = E [ e tX ] = ∫ 0 ∞ e t x λ e − λ x d x = λ ∫ 0 ∞ e t x − λ x d x = t − λ λ ∫ 0 ∞ ( t − λ ) e ( t − λ ) x d x = t − λ λ
M X ( 0 ) = E [ 1 ] = 1 M_X(0) = E[1] = 1 M X ( 0 ) = E [ 1 ] = 1
The n-th derivative of the MGF at t = 0 t = 0 t = 0 is:
M X ( n ) ( 0 ) = E [ X n e t X ] ( 0 ) = E [ X n ] M_X^{(n)}(0) = E[X^{n}e^{tX}](0) = E[X^{n}] M X ( n ) ( 0 ) = E [ X n e tX ] ( 0 ) = E [ X n ]
By the previous property, the Maclaurin series of the MGF is:
M X ( t ) = ∑ n = 0 ∞ E [ X n ] n ! t n M_X(t) = \sum_{n=0}^{\infty} \frac{E[X^{n}]}{n!} t^{n} M X ( t ) = n = 0 ∑ ∞ n ! E [ X n ] t n
Also, if X X X have MGF M X ( t ) M_X(t) M X ( t ) and Y = a X + b Y = aX + b Y = a X + b , then Y Y Y have MGF M Y ( t ) = e t b M X ( a t ) M_Y(t) = e^{tb}M_X(at) M Y ( t ) = e t b M X ( a t ) .
Given two random variables X X X and Y Y Y with MGF M X ( t ) M_X(t) M X ( t ) and M Y ( t ) M_Y(t) M Y ( t ) , if M X ( t ) = M Y ( t ) M_X(t) = M_Y(t) M X ( t ) = M Y ( t ) for all t t t in an open neighbourhood of 0 0 0 , then X X X and Y Y Y have the same distribution.
The joint moment generating function (JMGF) of random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n is defined as a function from R n \mathbb{R}^n R n to R \mathbb{R} R :
M X ( t ) = E [ e t T X ] M_\mathbf{X}(\mathbf{t}) = E[e^{\mathbf{t}^T \mathbf{X}}] M X ( t ) = E [ e t T X ]
where t = ( t 1 , t 2 , … , t n ) T \mathbf{t} = (t_1, t_2, \ldots, t_n)^T t = ( t 1 , t 2 , … , t n ) T and X = ( X 1 , X 2 , … , X n ) T \mathbf{X} = (X_1, X_2, \ldots, X_n)^T X = ( X 1 , X 2 , … , X n ) T .
If the JMGF exists and is finite on a open neighbourhood of 0 \mathbf{0} 0 , then we say the JMGF exists.
If the JMGF exists and is finite on a open neighbourhood of 0 \mathbf{0} 0 ,
then it uniquely determines the joint distribution of X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n .
The MGF of X i X_i X i can be expressed as:
M X i ( t i ) = M X ( 0 , … , 0 , t i , 0 , … , 0 ) M_{X_i}(t_i) = M_{\mathbf{X}}(0, \ldots, 0, t_i, 0, \ldots, 0) M X i ( t i ) = M X ( 0 , … , 0 , t i , 0 , … , 0 )
The joint moment of X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n can be expressed as:
E [ X 1 i 1 X 2 i 2 … X n i n ] = ∂ i 1 + i 2 + … + i n M X ( t ) ∂ t 1 i 1 ∂ t 2 i 2 … ∂ t n i n ∣ t = 0 E[X_1^{i_1}X_2^{i_2}\ldots X_n^{i_n}] = \left. \frac{\partial^{i_1 + i_2 + \ldots + i_n} M_{\mathbf{X}}(\mathbf{t})}{\partial t_1^{i_1} \partial t_2^{i_2} \ldots \partial t_n^{i_n}} \right|_{\mathbf{t} = \mathbf{0}} E [ X 1 i 1 X 2 i 2 … X n i n ] = ∂ t 1 i 1 ∂ t 2 i 2 … ∂ t n i n ∂ i 1 + i 2 + … + i n M X ( t ) t = 0
Given random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n with MGF M X i ( t i ) M_{X_i}(t_i) M X i ( t i ) ,
and JMGF M X 1 , X 2 , … , X n ( t ) M_{X_1, X_2, \ldots, X_n}(\mathbf{t}) M X 1 , X 2 , … , X n ( t ) ,
then X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n are mutually independent if and only if
M X 1 , X 2 , … , X n ( t ) = M X 1 ( t 1 ) M X 2 ( t 2 ) … M X n ( t n ) M_{X_1, X_2, \ldots, X_n}(\mathbf{t}) = M_{X_1}(t_1)M_{X_2}(t_2)\ldots M_{X_n}(t_n) M X 1 , X 2 , … , X n ( t ) = M X 1 ( t 1 ) M X 2 ( t 2 ) … M X n ( t n )
Given random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n that are independent,
and S = a 1 X 1 + a 2 X 2 + … + a n X n S = a_1X_1 + a_2X_2 + \ldots + a_nX_n S = a 1 X 1 + a 2 X 2 + … + a n X n ,
then the MGF of S S S is
M S ( t ) = M X 1 ( a 1 t ) M X 2 ( a 2 t ) … M X n ( a n t ) M_S(t) = M_{X_1}(a_1t)M_{X_2}(a_2t)\ldots M_{X_n}(a_nt) M S ( t ) = M X 1 ( a 1 t ) M X 2 ( a 2 t ) … M X n ( a n t )
Given a random variable X X X that takes non-negative integer values,
the probability generating function (PGF) G X ( z ) G_X(z) G X ( z ) of X X X is defined as
ϕ X ( z ) = E [ z X ] = ∑ x = 0 ∞ z x P ( X = x ) \phi_X(z) = E[z^X] = \sum_{x=0}^{\infty} z^x P(X = x) ϕ X ( z ) = E [ z X ] = x = 0 ∑ ∞ z x P ( X = x )
ϕ X ( 1 ) = 1 \phi_X(1) = 1 ϕ X ( 1 ) = 1
The PMF of X X X is uniquely determined by ϕ X ( z ) \phi_X(z) ϕ X ( z ) .
The n-th factorial moment of X X X is
E [ X ( X − 1 ) … ( X − n + 1 ) ] = d n ϕ X ( z ) d z n ∣ z = 1 E[X(X-1)\ldots(X-n+1)] = \left. \frac{d^n \phi_X(z)}{dz^n} \right|_{z=1} E [ X ( X − 1 ) … ( X − n + 1 )] = d z n d n ϕ X ( z ) z = 1
Random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n are mutually independent if and only if
the joint PGF ϕ X 1 , X 2 , … , X n ( z 1 , z 2 , … , z n ) \phi_{X_1, X_2, \ldots, X_n}(z_1, z_2, \ldots, z_n) ϕ X 1 , X 2 , … , X n ( z 1 , z 2 , … , z n ) is
ϕ X 1 , X 2 , … , X n ( z 1 , z 2 , … , z n ) = E [ z 1 X 1 z 2 X 2 … z n X n ] = ϕ X 1 ( z 1 ) ϕ X 2 ( z 2 ) … ϕ X n ( z n ) \phi_{X_1, X_2, \ldots, X_n}(z_1, z_2, \ldots, z_n) = E[z_1^{X_1}z_2^{X_2}\ldots z_n^{X_n}] =\phi_{X_1}(z_1)\phi_{X_2}(z_2)\ldots \phi_{X_n}(z_n) ϕ X 1 , X 2 , … , X n ( z 1 , z 2 , … , z n ) = E [ z 1 X 1 z 2 X 2 … z n X n ] = ϕ X 1 ( z 1 ) ϕ X 2 ( z 2 ) … ϕ X n ( z n )
The PGF of sum of independent random variables X 1 , X 2 , … , X n X_1, X_2, \ldots, X_n X 1 , X 2 , … , X n is
ϕ X 1 + X 2 + … + X n ( z ) = ϕ X 1 ( z ) ϕ X 2 ( z ) … ϕ X n ( z ) \phi_{X_1 + X_2 + \ldots + X_n}(z) = \phi_{X_1}(z)\phi_{X_2}(z)\ldots \phi_{X_n}(z) ϕ X 1 + X 2 + … + X n ( z ) = ϕ X 1 ( z ) ϕ X 2 ( z ) … ϕ X n ( z )
Given a random variable X X X that takes non-negative integer values,
and the PGF ϕ X ( z ) \phi_X(z) ϕ X ( z ) and MGF M X ( t ) M_X(t) M X ( t ) of X X X ,
then
ϕ X ( e t ) = M X ( t ) M X ( ln ( t ) ) = ϕ X ( t ) \begin{align}
\phi_X(e^t) = M_X(t) \\
M_X(\ln(t)) = \phi_X(t)
\end{align} ϕ X ( e t ) = M X ( t ) M X ( ln ( t )) = ϕ X ( t )
Given a non-negative random variable X X X and a > 0 a > 0 a > 0 ,
then
P ( X ≥ a ) ≤ E [ X ] a P(X \geq a) \leq \frac{E[X]}{a} P ( X ≥ a ) ≤ a E [ X ]
Proof:
P ( X ≥ c ) = ∫ c ∞ f X ( x ) d x = 1 c ∫ c ∞ c f X ( x ) d x ≤ 1 c ∫ c ∞ x f X ( x ) d x ≤ 1 c ∫ 0 ∞ x f X ( x ) d x = E [ X ] c \begin{align}
P(X \geq c) &= \int_c^{\infty} f_X(x) dx \\
&= \frac{1}{c} \int_c^{\infty} cf_X(x) dx \\
&\le \frac{1}{c} \int_c^{\infty} xf_X(x) dx \\
&\le \frac{1}{c} \int_0^{\infty} xf_X(x) dx
&= \frac{E[X]}{c}
\end{align} P ( X ≥ c ) = ∫ c ∞ f X ( x ) d x = c 1 ∫ c ∞ c f X ( x ) d x ≤ c 1 ∫ c ∞ x f X ( x ) d x ≤ c 1 ∫ 0 ∞ x f X ( x ) d x = c E [ X ]
Given a random variable X X X with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 ,
and a > 0 a > 0 a > 0 ,
then
P ( ∣ X − μ ∣ ≥ a ) ≤ σ 2 a 2 P(|X - \mu| \geq a) \leq \frac{\sigma^2}{a^2} P ( ∣ X − μ ∣ ≥ a ) ≤ a 2 σ 2
Proof:
Define Y = ( X − μ ) 2 Y = (X - \mu)^2 Y = ( X − μ ) 2 ,
then Y Y Y is a non-negative random variable,
and E [ Y ] = Var [ X ] = σ 2 E[Y] = \text{Var}[X] = \sigma^2 E [ Y ] = Var [ X ] = σ 2 .
By Markov Inequality ,
P ( Y ≥ a 2 ) ≤ E [ Y ] a 2 = σ 2 a 2 P(Y \geq a^2) \leq \frac{E[Y]}{a^2} = \frac{\sigma^2}{a^2} P ( Y ≥ a 2 ) ≤ a 2 E [ Y ] = a 2 σ 2
Then,
P ( ∣ X − μ ∣ ≥ a ) = P ( ( X − μ ) 2 ≥ a 2 ) = P ( Y ≥ a 2 ) ≤ σ 2 a 2 P(|X - \mu| \geq a) = P((X - \mu)^2 \geq a^2) = P(Y \geq a^2) \leq \frac{\sigma^2}{a^2} P ( ∣ X − μ ∣ ≥ a ) = P (( X − μ ) 2 ≥ a 2 ) = P ( Y ≥ a 2 ) ≤ a 2 σ 2
We define the higher dimensional normal distribution as an analog of the one dimensional normal distribution .
We say a random vector X = ( X 1 , X 2 , … , X n ) T \mathbf{X} = (X_1, X_2, \ldots, X_n)^T X = ( X 1 , X 2 , … , X n ) T follows a multivariate normal distribution if it can be expressed as
X = μ + A Z \mathbf{X} = \mathbf{\mu} + \mathbf{A}\mathbf{Z} X = μ + AZ
where l ≤ n l\le n l ≤ n and μ \mathbf{\mu} μ is a vector of means, A \mathbf{A} A is a n × l n\times l n × l matrix of constants, and Z \mathbf{Z} Z is a l × 1 l\times 1 l × 1 vector of independent standard normal random variables.
In convention, we write Σ = A A T \Sigma = A A^T Σ = A A T , and we denote the multivariate normal distribution as
X ∼ N n ( μ , Σ ) \mathbf{X} \sim N_n(\mathbf{\mu}, \Sigma) X ∼ N n ( μ , Σ )
If we assume that Σ \Sigma Σ has full rank, We can use multivariate transformation to derive the joint PDF of X \mathbf{X} X :
f X ( x ) = f Z ( A − 1 ( x − μ ) ) ∣ det ( A − 1 ) ∣ = ∣ det ( A − 1 ) ∣ ∏ i = 1 n 1 2 π e − 1 2 z i 2 = ∣ det ( A − 1 ) ∣ 1 2 π n e − 1 2 Z T Z = ∣ det ( A − 1 ) ∣ 1 2 π n e − 1 2 ( A − 1 ( x − μ ) ) T A − 1 ( x − μ ) = ∣ det ( A − 1 ) ∣ 1 2 π n e − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) \begin{align}
f_X(x) &= f_Z(A^{-1}(x-\mu))|\det(A^{-1})| \\
&= |\det(A^{-1})| \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}z_i^2} \\
&= |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}Z^TZ} \\
&= |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}(A^{-1}(x-\mu))^TA^{-1}(x-\mu)} \\
&= |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} \\
\end{align} f X ( x ) = f Z ( A − 1 ( x − μ )) ∣ det ( A − 1 ) ∣ = ∣ det ( A − 1 ) ∣ i = 1 ∏ n 2 π 1 e − 2 1 z i 2 = ∣ det ( A − 1 ) ∣ 2 π n 1 e − 2 1 Z T Z = ∣ det ( A − 1 ) ∣ 2 π n 1 e − 2 1 ( A − 1 ( x − μ ) ) T A − 1 ( x − μ ) = ∣ det ( A − 1 ) ∣ 2 π n 1 e − 2 1 ( x − μ ) T Σ − 1 ( x − μ )
If we assume that Σ \Sigma Σ has full rank,
the joint moment generating function of X \mathbf{X} X is
M X ( t ) = E [ e t T X ] = ∫ R n e t T x f X ( x ) d x = ∫ R n e t T x ∣ det ( A − 1 ) ∣ 1 2 π n e − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) d x = ∣ det ( A − 1 ) ∣ 2 π n ∫ R n e − 1 2 [ 2 t T x + x T Σ − 1 x − x T Σ − 1 μ − μ T Σ − 1 x + μ T Σ − 1 μ ] d x = ∣ det ( A − 1 ) ∣ 2 π n ∫ R n e − 1 2 [ 2 ( Σ T t ) T Σ − 1 x + x T Σ − 1 x − 2 μ T Σ − 1 x + μ T Σ − 1 μ ] d x = ∣ det ( A − 1 ) ∣ 2 π n ∫ R n e − 1 2 [ + x T Σ − 1 x − 2 ( μ − Σ T t ) T Σ − 1 x + μ T Σ − 1 μ ] d x = ∣ det ( A − 1 ) ∣ 2 π n e − 1 2 [ − ( μ − Σ T t ) T Σ − 1 ( μ − Σ T t ) + μ T Σ − 1 μ ] ∫ R n e − 1 2 [ ( x − μ − Σ T t ) T Σ − 1 ( x − μ − Σ T t ) ] d x = e − 1 2 [ − ( μ − Σ T t ) T Σ − 1 ( μ − Σ T t ) + μ T Σ − 1 μ ] = e t T μ + 1 2 t T Σ t \begin{align}
M_{X}(t) &= E[e^{t^TX}] \\
&= \int_{\mathbb{R}^n} e^{t^Tx} f_X(x) dx \\
&= \int_{\mathbb{R}^n} e^{t^Tx} |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
2t^Tx
+x^T\Sigma^{-1}x
-x^T\Sigma^{-1}\mu
-\mu^T\Sigma^{-1}x
+\mu^T\Sigma^{-1}\mu
\right]} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
2(\Sigma^{T}t)^T \Sigma^{-1}x
+x^T\Sigma^{-1}x
-2\mu^T\Sigma^{-1}x
+\mu^T\Sigma^{-1}\mu
\right]} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
+x^T\Sigma^{-1}x
-2(\mu-\Sigma^{T}t)^T\Sigma^{-1}x
+\mu^T\Sigma^{-1}\mu
\right]} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} e^{-\frac{1}{2}\left[
-(\mu-\Sigma^{T}t)^T\Sigma^{-1}(\mu-\Sigma^{T}t)
+\mu^T\Sigma^{-1}\mu
\right]} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
(x-\mu-\Sigma^{T}t)^T\Sigma^{-1}(x-\mu-\Sigma^{T}t)
\right]} dx \\
&= e^{-\frac{1}{2}\left[
-(\mu-\Sigma^{T}t)^T\Sigma^{-1}(\mu-\Sigma^{T}t)
+\mu^T\Sigma^{-1}\mu
\right]} \\
&= e^{t^T\mu + \frac{1}{2}t^T\Sigma t}
\end{align} M X ( t ) = E [ e t T X ] = ∫ R n e t T x f X ( x ) d x = ∫ R n e t T x ∣ det ( A − 1 ) ∣ 2 π n 1 e − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) d x = 2 π n ∣ det ( A − 1 ) ∣ ∫ R n e − 2 1 [ 2 t T x + x T Σ − 1 x − x T Σ − 1 μ − μ T Σ − 1 x + μ T Σ − 1 μ ] d x = 2 π n ∣ det ( A − 1 ) ∣ ∫ R n e − 2 1 [ 2 ( Σ T t ) T Σ − 1 x + x T Σ − 1 x − 2 μ T Σ − 1 x + μ T Σ − 1 μ ] d x = 2 π n ∣ det ( A − 1 ) ∣ ∫ R n e − 2 1 [ + x T Σ − 1 x − 2 ( μ − Σ T t ) T Σ − 1 x + μ T Σ − 1 μ ] d x = 2 π n ∣ det ( A − 1 ) ∣ e − 2 1 [ − ( μ − Σ T t ) T Σ − 1 ( μ − Σ T t ) + μ T Σ − 1 μ ] ∫ R n e − 2 1 [ ( x − μ − Σ T t ) T Σ − 1 ( x − μ − Σ T t ) ] d x = e − 2 1 [ − ( μ − Σ T t ) T Σ − 1 ( μ − Σ T t ) + μ T Σ − 1 μ ] = e t T μ + 2 1 t T Σ t
By the joint moment generating function of multivariate normal distribution ,
E [ X 1 k 1 X 2 k 2 … X n k n ] = ∂ k 1 + k 2 + … + k n M X ( t ) ∂ t 1 k 1 ∂ t 2 k 2 … ∂ t n k n ∣ t = 0 E[X_{1}^{k_1}X_{2}^{k_2}\ldots X_{n}^{k_n}] = \left. \frac{\partial^{k_1 + k_2 + \ldots + k_n} M_{X}(\mathbf{t})}{\partial t_1^{k_1} \partial t_2^{k_2} \ldots \partial t_n^{k_n}} \right|_{\mathbf{t} = \mathbf{0}} E [ X 1 k 1 X 2 k 2 … X n k n ] = ∂ t 1 k 1 ∂ t 2 k 2 … ∂ t n k n ∂ k 1 + k 2 + … + k n M X ( t ) t = 0
Especially, as,
∂ ∂ t i M X ( t ) = ∂ ∂ t i e t T μ + 1 2 t T Σ t = [ ∂ t ∂ t i T μ + 1 2 ∂ t ∂ t i T Σ t + t T Σ ∂ t ∂ t i ] M X ( t ) = e i T [ μ + Σ t ] M X ( t ) \begin{align}
\frac{\partial}{\partial t_i} M_{X}(\mathbf{t}) &= \frac{\partial}{\partial t_i} e^{t^T\mu + \frac{1}{2}t^T\Sigma t} \\
&= \left[
\frac{\partial t}{\partial t_i}^T\mu + \frac{1}{2} \frac{\partial t}{\partial t_i}^T\Sigma t + t^T\Sigma \frac{\partial t}{\partial t_i}
\right] M_{X}(\mathbf{t})
&= e_i^T \left[
\mu + \Sigma t
\right] M_{X}(\mathbf{t})
\end{align} ∂ t i ∂ M X ( t ) = ∂ t i ∂ e t T μ + 2 1 t T Σ t = [ ∂ t i ∂ t T μ + 2 1 ∂ t i ∂ t T Σ t + t T Σ ∂ t i ∂ t ] M X ( t ) = e i T [ μ + Σ t ] M X ( t )
∂ 2 ∂ t j ∂ t i = ∂ ∂ t j e i T [ μ + Σ t ] M X ( t ) = e i T Σ e j T M X ( t ) + e i T [ μ + Σ t ] e j T [ μ + Σ t ] M X ( t ) = e i T Σ e j T M X ( t ) + e i T [ μ + Σ t ] [ μ + Σ t ] T e j M X ( t ) = e i T Σ e j T M X ( t ) + e i T [ μ μ T + μ ( Σ t ) T + Σ t μ T + Σ t ( Σ t ) T ] e j M X ( t ) \begin{align}
\frac{\partial^2}{\partial t_j\partial t_i} &= \frac{\partial}{\partial t_j} e_i^T \left[
\mu + \Sigma t
\right] M_{X}(\mathbf{t}) \\
&= e_i^T \Sigma e_j^T M_{X}(\mathbf{t}) + e_i^T \left[
\mu + \Sigma t
\right] e_j^T \left[
\mu + \Sigma t
\right] M_{X}(\mathbf{t}) \\
&= e_i^T \Sigma e_j^T M_{X}(\mathbf{t}) + e_i^T \left[
\mu + \Sigma t
\right] \left[
\mu + \Sigma t
\right]^T e_j M_{X}(\mathbf{t}) \\
&= e_i^T \Sigma e_j^T M_{X}(\mathbf{t}) + e_i^T \left[
\mu \mu^T
+ \mu (\Sigma t)^T
+ \Sigma t \mu^T
+ \Sigma t (\Sigma t)^T
\right] e_j M_{X}(\mathbf{t}) \\
\end{align} ∂ t j ∂ t i ∂ 2 = ∂ t j ∂ e i T [ μ + Σ t ] M X ( t ) = e i T Σ e j T M X ( t ) + e i T [ μ + Σ t ] e j T [ μ + Σ t ] M X ( t ) = e i T Σ e j T M X ( t ) + e i T [ μ + Σ t ] [ μ + Σ t ] T e j M X ( t ) = e i T Σ e j T M X ( t ) + e i T [ μ μ T + μ ( Σ t ) T + Σ t μ T + Σ t ( Σ t ) T ] e j M X ( t )
where e i e_i e i is the i i i -th unit vector.
Then, we can calculate the moments of the multivariate normal distribution.
E ( X i ) = ∂ ∂ t i M X ( 0 ) = e i T [ μ ] M X ( 0 ) = μ i E(X_i) = \frac{\partial}{\partial t_i} M_{X}(0) = e_i^T \left[
\mu
\right] M_{X}(0) = \mu_i E ( X i ) = ∂ t i ∂ M X ( 0 ) = e i T [ μ ] M X ( 0 ) = μ i
E ( X i X j ) = ∂ 2 ∂ t j ∂ t i M X ( 0 ) = e i T Σ e j T M X ( 0 ) + e i T [ μ μ T + μ ( Σ 0 ) T + Σ 0 μ T + Σ 0 ( Σ 0 ) T ] e j M X ( 0 ) = e i T Σ e j T M X ( 0 ) = Σ i j + μ i μ j \begin{align}
E(X_iX_j) &= \frac{\partial^2}{\partial t_j\partial t_i} M_{X}(0) \\
&= e_i^T \Sigma e_j^T M_{X}(0) + e_i^T \left[
\mu \mu^T
+ \mu (\Sigma 0)^T
+ \Sigma 0 \mu^T
+ \Sigma 0 (\Sigma 0)^T
\right] e_j M_{X}(0) \\
&= e_i^T \Sigma e_j^T M_{X}(0) \\
&= \Sigma_{ij} + \mu_i\mu_j
\end{align} E ( X i X j ) = ∂ t j ∂ t i ∂ 2 M X ( 0 ) = e i T Σ e j T M X ( 0 ) + e i T [ μ μ T + μ ( Σ0 ) T + Σ0 μ T + Σ0 ( Σ0 ) T ] e j M X ( 0 ) = e i T Σ e j T M X ( 0 ) = Σ ij + μ i μ j
And the covariance:
Cov ( X i , X j ) = E ( X i X j ) − E ( X i ) E ( X j ) = Σ i j \text{Cov}(X_i, X_j) = E(X_iX_j) - E(X_i)E(X_j) = \Sigma_{ij} Cov ( X i , X j ) = E ( X i X j ) − E ( X i ) E ( X j ) = Σ ij
Thus, the covariance matrix of X \mathbf{X} X is Σ \Sigma Σ .
Given n = 2 n=2 n = 2 in multivariate normal distribution ,
we have the bivariate normal distribution .
The joint PDF of bivariate normal distribution is
f X 1 , X 2 ( x 1 , x 2 ) = ∣ det A − 1 ∣ 2 π e − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) f_{X_1, X_2}(x_1, x_2) = \frac{|\det A^{-1}|}{2\pi} e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} f X 1 , X 2 ( x 1 , x 2 ) = 2 π ∣ det A − 1 ∣ e − 2 1 ( x − μ ) T Σ − 1 ( x − μ )
where μ = ( μ 1 , μ 2 ) T \mu = (\mu_1, \mu_2)^T μ = ( μ 1 , μ 2 ) T and Σ = [ σ 11 σ 12 σ 21 σ 22 ] \Sigma = \begin{bmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix} Σ = [ σ 11 σ 21 σ 12 σ 22 ] .
By the moments of multivariate normal distribution ,
Σ \Sigma Σ can also be expressed as
Σ = [ σ 11 σ 12 σ 21 σ 22 ] = [ Var ( X 1 ) Cov ( X 1 , X 2 ) Cov ( X 1 , X 2 ) Var ( X 2 ) ] \Sigma = \begin{bmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) \\ \text{Cov}(X_1, X_2) & \text{Var}(X_2) \end{bmatrix} Σ = [ σ 11 σ 21 σ 12 σ 22 ] = [ Var ( X 1 ) Cov ( X 1 , X 2 ) Cov ( X 1 , X 2 ) Var ( X 2 ) ]
Given X X X a multivariate normal distribution,
and Y = A X + b Y = AX+b Y = A X + b , then Y Y Y is also a multivariate normal distribution.
Given X X X a multivariate normal distribution,
to get the marginal distribution of Y = ( X k 1 , X k 2 , … , X k i ) Y = (X_{k_1},X_{k_2},\ldots,X_{k_i}) Y = ( X k 1 , X k 2 , … , X k i ) ,
we can let A A A be a i × n i\times n i × n matrix with 1 1 1 at the k 1 , k 2 , … , k i k_1, k_2, \ldots, k_i k 1 , k 2 , … , k i -th row,
and b b b be a i × 1 i\times 1 i × 1 vector of zeros.
Then Y = A X + b Y= AX + b Y = A X + b .
Given X X X a multivariate normal distribution,
X i X_i X i and X j X_j X j are independent if and only if Cov ( X i , X j ) = 0 \text{Cov}(X_i, X_j) = 0 Cov ( X i , X j ) = 0 .
Given X X X a multivariate normal distribution,
if Σ \Sigma Σ is a singular matrix,
then X X X is a degenerate multivariate normal distribution.
Suppose x x x is a eigenvector of Σ \Sigma Σ with eigenvalue 0 0 0 ,
then let Y = x T X Y = x^T X Y = x T X ,
then the mean of Y Y Y is x T μ x^T \mu x T μ ,
and the variance of Y Y Y is x T Σ x = 0 x^T \Sigma x = 0 x T Σ x = 0 .
Then Y Y Y is the distribution of a constant.
In this section, we assume X X X to be IID random variables. And μ = E [ X ] \mu = E[X] μ = E [ X ] , σ 2 = Var ( X ) \sigma^2 = \text{Var}(X) σ 2 = Var ( X ) .
Let S n = ∑ i = 1 n X S_{n} = \sum_{i=1}^{n}X S n = ∑ i = 1 n X .
Then,
E [ S n ] = n μ E[S_{n}] = n\mu E [ S n ] = n μ
Var ( S n ) = n σ 2 \text{Var}(S_{n}) = n\sigma^2 Var ( S n ) = n σ 2
E [ S n n ] = μ E\left[\frac{S_{n}}{n}\right] = \mu E [ n S n ] = μ
Var ( S n n ) = σ 2 n \text{Var}\left(\frac{S_{n}}{n}\right) = \frac{\sigma^2}{n} Var ( n S n ) = n σ 2
By intuition, as n n n increases, the sample mean S n n \frac{S_{n}}{n} n S n converges to the population mean μ \mu μ . In formal terms, we have the weak law of large numbers :
P ( lim n → ∞ S n n = μ ) = 1 P\left(\lim_{n\rightarrow\infty} \frac{S_{n}}{n} = \mu\right) = 1 P ( n → ∞ lim n S n = μ ) = 1
Or equivalently,
for all ϵ > 0 \epsilon > 0 ϵ > 0 ,
lim n → ∞ P ( ∣ S n n − μ ∣ ≥ ϵ ) = 0 \lim_{n\rightarrow\infty} P\left(|\frac{S_{n}}{n} - \mu| \geq \epsilon\right) = 0 n → ∞ lim P ( ∣ n S n − μ ∣ ≥ ϵ ) = 0
Given X X X a random variable with CDF F X ( x ) F_X(x) F X ( x ) ,
and sequence of random variables X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … with CDF F X n ( x ) F_{X_n}(x) F X n ( x ) ,
we say that the sequence of random variables converges in distribution to X X X if
F X n ( x ) F_{X_n}(x) F X n ( x ) converge pointwise to F X ( x ) F_X(x) F X ( x ) .
As the MGF uniquely determines the distribution of a random variable,
we have the following theorem:
Given X X X a random variable with MGF M X ( t ) M_X(t) M X ( t ) ,
and sequence of random variables X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … with MGF M X n ( t ) M_{X_n}(t) M X n ( t ) ,
and all of the MGFs exist and are finite on a same open neighbourhood of 0 0 0 ,
then X 1 , X 2 , … X_1, X_2, \ldots X 1 , X 2 , … converge in distribution to X X X if and only if
M X n ( t ) M_{X_n}(t) M X n ( t ) converge pointwise to M X ( t ) M_X(t) M X ( t ) on the open neighbourhood of 0 0 0 .
Given X X X a random variable with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 ,
and S n = ∑ i = 1 n X S_{n} = \sum_{i=1}^{n}X S n = ∑ i = 1 n X ,
then the distribution of S n − n μ n σ \frac{S_{n} - n\mu}{\sqrt{n}\sigma} n σ S n − n μ converges in distribution to the standard normal distribution.