A random variable is a function that maps the outcomes Ī© \Omega Ī© of a random process to numerical values R \mathbb{R} R . If the Ī© \Omega Ī© is discrete, the random variable is called a discrete random variable. If the Ī© \Omega Ī© is continuous, the random variable is called a continuous random variable.
The cumulative distribution function (CDF) of a random variable X X X is defined as
F ( x ) = P ( X ⤠x ) F(x) = P(X \leq x) F ( x ) = P ( X ⤠x )
The probability mass function (PMF) of a discrete random variable X X X is defined as
p ( x ) = P ( X = x ) p(x) = P(X = x) p ( x ) = P ( X = x )
The probability density function (PDF) of a continuous random variable X X X is defined as
f ( x ) = d F ( x ) d x f(x) = \frac{dF(x)}{dx} f ( x ) = d x d F ( x ) ā
where F ( x ) F(x) F ( x ) is the CDF of X X X .
The expectation of a random variable X X X is defined as
E [ X ] = ā x x p ( x ) forĀ discreteĀ randomĀ variables E[X] = \sum_{x} x p(x) \quad \text{for discrete random variables} E [ X ] = x ā ā x p ( x ) forĀ discreteĀ randomĀ variables
where p ( x ) p(x) p ( x ) is the PMF of X X X .
E [ X ] = ā« ā ā ā x f ( x ) d x forĀ continuousĀ randomĀ variables E[X] = \int_{-\infty}^{\infty} x f(x) dx \quad \text{for continuous random variables} E [ X ] = ā« ā ā ā ā x f ( x ) d x forĀ continuousĀ randomĀ variables
where f ( x ) f(x) f ( x ) is the PDF of X X X .
The variance of a random variable X X X is defined as
Var ( X ) = E [ ( X ā E [ X ] ) 2 ] \text{Var}(X) = E[(X - E[X])^2] Var ( X ) = E [( X ā E [ X ] ) 2 ]
By the definition of variance, the variance is always non-negative.
Alternatively, the variance can be calculated as
Var ( X ) = E [ X 2 ] ā E [ X ] 2 \text{Var}(X) = E[X^2] - E[X]^2 Var ( X ) = E [ X 2 ] ā E [ X ] 2
E [ a X + b ] = a E [ X ] + b E[aX + b] = aE[X] + b E [ a X + b ] = a E [ X ] + b
E [ a X + b Y ] = a E [ X ] + b E [ Y ] E[aX + bY] = aE[X] + bE[Y] E [ a X + bY ] = a E [ X ] + b E [ Y ]
Var ( a X + b ) = a 2 Var ( X ) \text{Var}(aX + b) = a^2 \text{Var}(X) Var ( a X + b ) = a 2 Var ( X )
Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y )
where a a a and b b b are constants, and Cov ( X , Y ) \text{Cov}(X, Y) Cov ( X , Y ) is the covariance between X X X and Y Y Y .
The n-th moment of a random variable X X X is defined as
E [ X n ] = ā x x n p ( x ) forĀ discreteĀ randomĀ variables E[X^n] = \sum_{x} x^n p(x) \quad \text{for discrete random variables} E [ X n ] = x ā ā x n p ( x ) forĀ discreteĀ randomĀ variables
The standard deviation of a random variable X X X is defined as
SD ( X ) = Var ( X ) \text{SD}(X) = \sqrt{\text{Var}(X)} SD ( X ) = Var ( X ) ā
The Bernoulli distribution is a discrete distribution with two possible outcomes: 0 and 1. The PMF of a Bernoulli random variable X X X is defined as
p ( x ) = { p ifĀ x = 1 1 ā p ifĀ x = 0 p(x) = \begin{cases}
p & \text{if } x = 1 \\
1 - p & \text{if } x = 0
\end{cases} p ( x ) = { p 1 ā p ā ifĀ x = 1 ifĀ x = 0 ā
where p p p is the probability of success.
The expectation and variance of a Bernoulli random variable X X X are
E [ X ] = p and Var ( X ) = p ( 1 ā p ) E[X] = p \quad \text{and} \quad \text{Var}(X) = p(1 - p) E [ X ] = p and Var ( X ) = p ( 1 ā p )
The Binomial distribution is a discrete distribution that models the number of successes in a fixed number of independent Bernoulli trials. The PMF of a Binomial random variable X X X is defined as
p ( x ) = ( n x ) p x ( 1 ā p ) n ā x p(x) = \binom{n}{x} p^x (1 - p)^{n - x} p ( x ) = ( x n ā ) p x ( 1 ā p ) n ā x
where n n n is the number of trials, x x x is the number of successes, and p p p is the probability of success.
The expectation and variance of a Binomial random variable X X X are
E [ X ] = n p and Var ( X ) = n p ( 1 ā p ) E[X] = np \quad \text{and} \quad \text{Var}(X) = np(1 - p) E [ X ] = n p and Var ( X ) = n p ( 1 ā p )
The Poisson distribution is a discrete distribution that models the number of events occurring in a fixed interval of time or space. The PMF of a Poisson random variable X X X is defined as
p ( x ) = Ī» x e ā Ī» x ! p(x) = \frac{\lambda^x e^{-\lambda}}{x!} p ( x ) = x ! Ī» x e ā Ī» ā
where Ī» \lambda Ī» is the average rate of events.
The expectation and variance of a Poisson random variable X X X are
E [ X ] = Ī» and Var ( X ) = Ī» E[X] = \lambda \quad \text{and} \quad \text{Var}(X) = \lambda E [ X ] = Ī» and Var ( X ) = Ī»
The Geometric distribution is a discrete distribution that models the number of trials needed to achieve the first success in a sequence of independent Bernoulli trials. The PMF of a Geometric random variable X X X is defined as
p ( x ) = ( 1 ā p ) x ā 1 p p(x) = (1 - p)^{x - 1} p p ( x ) = ( 1 ā p ) x ā 1 p
where x x x is the number of trials needed to achieve the first success, and p p p is the probability of success.
The expectation and variance of a Geometric random variable X X X are
E [ X ] = 1 p and Var ( X ) = 1 ā p p 2 E[X] = \frac{1}{p} \quad \text{and} \quad \text{Var}(X) = \frac{1 - p}{p^2} E [ X ] = p 1 ā and Var ( X ) = p 2 1 ā p ā
The Uniform distribution is a continuous distribution with a constant probability density function (PDF) over a fixed interval. The PDF of a Uniform random variable X X X is defined as
f ( x ) = { 1 b ā a ifĀ a ⤠x ⤠b 0 otherwise f(x) = \begin{cases}
\frac{1}{b - a} & \text{if } a \leq x \leq b \\
0 & \text{otherwise}
\end{cases} f ( x ) = { b ā a 1 ā 0 ā ifĀ a ⤠x ⤠b otherwise ā
where a a a and b b b are the lower and upper bounds of the interval.
The expectation and variance of a Uniform random variable X X X are
E [ X ] = a + b 2 and Var ( X ) = ( b ā a ) 2 12 E[X] = \frac{a + b}{2} \quad \text{and} \quad \text{Var}(X) = \frac{(b - a)^2}{12} E [ X ] = 2 a + b ā and Var ( X ) = 12 ( b ā a ) 2 ā
The Exponential distribution is a continuous distribution that models the time between events in a Poisson process. The PDF of an Exponential random variable X X X is defined as
f ( x ) = Ī» e ā Ī» x f(x) = \lambda e^{-\lambda x} f ( x ) = Ī» e ā Ī» x
where Ī» \lambda Ī» is the rate parameter.
The expectation and variance of an Exponential random variable X X X are
E [ X ] = 1 Ī» and Var ( X ) = 1 Ī» 2 E[X] = \frac{1}{\lambda} \quad \text{and} \quad \text{Var}(X) = \frac{1}{\lambda^2} E [ X ] = Ī» 1 ā and Var ( X ) = Ī» 2 1 ā
The Normal distribution is a continuous distribution that is symmetric and bell-shaped. The PDF of a Normal random variable X X X is defined as
f ( x ) = 1 2 Ļ Ļ e ā ( x ā μ ) 2 2 Ļ 2 f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x - \mu)^2}{2\sigma^2}} f ( x ) = 2 Ļ ā Ļ 1 ā e ā 2 Ļ 2 ( x ā μ ) 2 ā
where μ \mu μ is the mean and Ļ \sigma Ļ is the standard deviation.
The expectation and variance of a Normal random variable X X X are
E [ X ] = μ and Var ( X ) = Ļ 2 E[X] = \mu \quad \text{and} \quad \text{Var}(X) = \sigma^2 E [ X ] = μ and Var ( X ) = Ļ 2
We usually write X ā¼ N ( μ , Ļ 2 ) X \sim N(\mu, \sigma^2) X ā¼ N ( μ , Ļ 2 ) to denote that X X X follows a Normal distribution with mean μ \mu μ and variance Ļ 2 \sigma^2 Ļ 2 .
The Gamma distribution is a continuous distribution that generalizes the Exponential distribution. The PDF of a Gamma random variable X X X is defined as
f ( x ) = Ī» k x k ā 1 e ā Ī» x Ī ( k ) f(x) = \frac{\lambda^k x^{k - 1} e^{-\lambda x}}{\Gamma(k)} f ( x ) = Ī ( k ) Ī» k x k ā 1 e ā Ī» x ā
where Ī» \lambda Ī» is the rate parameter, k k k is the shape parameter, and Ī ( k ) \Gamma(k) Ī ( k ) is the gamma function.
Ī ( k ) = ā« 0 ā x k ā 1 e ā x d x \Gamma(k) = \int_{0}^{\infty} x^{k - 1} e^{-x} dx Ī ( k ) = ā« 0 ā ā x k ā 1 e ā x d x
The expectation and variance of a Gamma random variable X X X are
E [ X ] = k Ī» and Var ( X ) = k Ī» 2 E[X] = \frac{k}{\lambda} \quad \text{and} \quad \text{Var}(X) = \frac{k}{\lambda^2} E [ X ] = Ī» k ā and Var ( X ) = Ī» 2 k ā
The Beta distribution is a continuous distribution that is defined on the interval [0, 1]. The PDF of a Beta random variable X X X is defined as
f ( x ) = x α ā 1 ( 1 ā x ) β ā 1 B ( α , β ) f(x) = \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)} f ( x ) = B ( α , β ) x α ā 1 ( 1 ā x ) β ā 1 ā
where α \alpha α and β \beta β are the shape parameters, and B ( α , β ) B(\alpha, \beta) B ( α , β ) is the beta function.
B ( α , β ) = Ī ( α ) Ī ( β ) Ī ( α + β ) B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)} B ( α , β ) = Ī ( α + β ) Ī ( α ) Ī ( β ) ā
The expectation and variance of a Beta random variable X X X are
E [ X ] = α α + β and Var ( X ) = α β ( α + β ) 2 ( α + β + 1 ) E[X] = \frac{\alpha}{\alpha + \beta} \quad \text{and} \quad \text{Var}(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} E [ X ] = α + β α ā and Var ( X ) = ( α + β ) 2 ( α + β + 1 ) α β ā
A bivariate distribution is a probability distribution that describes the joint behaviour of two random variables.
Given two random variables X X X and Y Y Y , the joint probability mass function (PMF) for discrete random variables is defined as
p ( x , y ) = P ( X = x , Y = y ) p(x, y) = P(X = x, Y = y) p ( x , y ) = P ( X = x , Y = y )
The marginal probability mass function (PMF) of a random variable X X X is defined as
p X ( x ) = ā y p ( x , y ) p_X(x) = \sum_{y} p(x, y) p X ā ( x ) = y ā ā p ( x , y )
The conditional probability mass function (PMF) of a random variable X X X given Y = y Y = y Y = y is defined as
p X ⣠Y ( x ⣠y ) = p ( x , y ) p Y ( y ) p_{X|Y}(x|y) = \frac{p(x, y)}{p_Y(y)} p X ⣠Y ā ( x ⣠y ) = p Y ā ( y ) p ( x , y ) ā
The expectation of a bivariate distribution is defined as
E [ g ( X , Y ) ] = ā x ā y g ( x , y ) p ( x , y ) E[g(X, Y)] = \sum_{x} \sum_{y} g(x, y) p(x, y) E [ g ( X , Y )] = x ā ā y ā ā g ( x , y ) p ( x , y )
The covariance of two random variables X X X and Y Y Y is defined as
Cov ( X , Y ) = E [ ( X ā E [ X ] ) ( Y ā E [ Y ] ) ] \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] Cov ( X , Y ) = E [( X ā E [ X ]) ( Y ā E [ Y ])]
Covariance can sometimes be negative, zero, or positive.
And can be calculated as
Cov ( X , Y ) = E [ X Y ] ā E [ X ] E [ Y ] \text{Cov}(X, Y) = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [ X Y ] ā E [ X ] E [ Y ]
The correlation coefficient of two random variables X X X and Y Y Y is defined as
Ļ ( X , Y ) = Cov ( X , Y ) Var ( X ) Var ( Y ) \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} Ļ ( X , Y ) = Var ( X ) Var ( Y ) ā Cov ( X , Y ) ā
We can prove that ā 1 ā¤ Ļ ( X , Y ) ⤠1 -1 \leq \rho(X, Y) \leq 1 ā 1 ā¤ Ļ ( X , Y ) ⤠1 .
Two random variables X X X and Y Y Y are independent if and only if
p ( x , y ) = p X ( x ) p Y ( y ) p(x, y) = p_X(x) p_Y(y) p ( x , y ) = p X ā ( x ) p Y ā ( y )
for all x x x and y y y .
If X X X and Y Y Y are independent, then
E [ X Y ] = E [ X ] E [ Y ] and Cov ( X , Y ) = 0 E[XY] = E[X]E[Y] \quad \text{and} \quad \text{Cov}(X, Y) = 0 E [ X Y ] = E [ X ] E [ Y ] and Cov ( X , Y ) = 0
Two random variables X X X and Y Y Y are uncorrelated if and only if
Cov ( X , Y ) = 0 \text{Cov}(X, Y) = 0 Cov ( X , Y ) = 0
If X X X and Y Y Y are uncorrelated, then
E [ X Y ] = E [ X ] E [ Y ] E[XY] = E[X]E[Y] E [ X Y ] = E [ X ] E [ Y ]
Note: Uncorrelated random variables are not necessarily independent.
Given two random variables X X X and Y Y Y , the joint cumulative distribution function (CDF) for continuous random variables is defined as
F X , Y ( x , y ) = P ( X ⤠x , Y ⤠y ) F_{X,Y}(x, y) = P(X \leq x, Y \leq y) F X , Y ā ( x , y ) = P ( X ⤠x , Y ⤠y )
The marginal cumulative distribution function (CDF) of a random variable X X X is defined as
F X ( x ) = P ( X ⤠x ) = P ( X ⤠x , Y < ā ) = F X , Y ( x , ā ) F_X(x) = P(X \leq x) = P(X \leq x, Y < \infty) = F_{X,Y}(x, \infty) F X ā ( x ) = P ( X ⤠x ) = P ( X ⤠x , Y < ā ) = F X , Y ā ( x , ā )
Given two random variables X X X and Y Y Y , if there exists a function f ( x , y ) f(x, y) f ( x , y ) such that
P ( ( X , Y ) ā A ) = ⬠A f ( x , y ) d x d y P((X, Y) \in A) = \iint_{A} f(x, y) dx dy P (( X , Y ) ā A ) = ⬠A ā f ( x , y ) d x d y
for all Lebesgue-measurable sets A A A , then f ( x , y ) f(x, y) f ( x , y ) is the joint probability density function (PDF) of X X X and Y Y Y . And X X X and Y Y Y are called jointly continuous random variables.
By the definition of the joint PDF, we have
F X , Y ( x , y ) = ā« ā ā x ā« ā ā y f ( u , v ) d u d v F_{X,Y}(x, y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f(u, v) du dv F X , Y ā ( x , y ) = ā« ā ā x ā ā« ā ā y ā f ( u , v ) d u d v
And
f ( x , y ) = ā 2 F X , Y ( x , y ) ā x ā y f(x, y) = \frac{\partial^2 F_{X,Y}(x, y)}{\partial x \partial y} f ( x , y ) = ā x ā y ā 2 F X , Y ā ( x , y ) ā
The marginal probability density function (PDF) of a random variable X X X is defined as
f X ( x ) = d F X ( x ) d x = ā« ā ā ā f ( x , y ) d y f_X(x) = \frac{dF_X(x)}{dx} = \int_{-\infty}^{\infty} f(x, y) dy f X ā ( x ) = d x d F X ā ( x ) ā = ā« ā ā ā ā f ( x , y ) d y
Given n n n random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā ,
let the vector X = ( X 1 , X 2 , ⦠, X n ) \mathbf{X} = (X_1, X_2, \ldots, X_n) X = ( X 1 ā , X 2 ā , ⦠, X n ā ) ,
the joint cumulative distribution function (CDF) for continuous random variables is defined as
F X ( x ) = P ( X < x ) F_\mathbf{X}(\mathbf{x}) = P(\mathbf{X} < \mathbf{x}) F X ā ( x ) = P ( X < x )
Given n n n random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā ,
let the vector X = ( X 1 , X 2 , ⦠, X n ) \mathbf{X} = (X_1, X_2, \ldots, X_n) X = ( X 1 ā , X 2 ā , ⦠, X n ā ) ,
if there exists a function f ( x 1 , x 2 , ⦠, x n ) f(x_1, x_2, \ldots, x_n) f ( x 1 ā , x 2 ā , ⦠, x n ā ) such that
P ( X ā A ) = ā« A f ( x ) d X P(\mathbf{X} \in A) = \int_{A} f(\mathbf{x}) d\mathbf{X} P ( X ā A ) = ā« A ā f ( x ) d X
for all Lebesgue-measurable sets A A A , then f ( x ) f(\mathbf{x}) f ( x ) is the joint probability density function (PDF) of X \mathbf{X} X . And X \mathbf{X} X are called jointly continuous random variables.
By the definition of the joint PDF, we have
F X ( x ) = ā« ā ā x 1 ā« ā ā x 2 ⦠⫠ā ā x n f ( u 1 , u 2 , ⦠, u n ) d u 1 d u 2 ⦠d u n F_\mathbf{X}(\mathbf{x}) = \int_{-\infty}^{x_1} \int_{-\infty}^{x_2} \ldots \int_{-\infty}^{x_n} f(u_1, u_2, \ldots, u_n) du_1 du_2 \ldots du_n F X ā ( x ) = ā« ā ā x 1 ā ā ā« ā ā x 2 ā ā ⦠⫠ā ā x n ā ā f ( u 1 ā , u 2 ā , ⦠, u n ā ) d u 1 ā d u 2 ā ⦠d u n ā
And
f ( x ) = ā n F X ( x ) ā x 1 ā x 2 ⦠ā x n f(\mathbf{x}) = \frac{\partial^n F_\mathbf{X}(\mathbf{x})}{\partial x_1 \partial x_2 \ldots \partial x_n} f ( x ) = ā x 1 ā ā x 2 ā ⦠ā x n ā ā n F X ā ( x ) ā
The marginal probability density function (PDF) of a random variable X i X_i X i ā is defined as
F X k 1 , X k 2 , ⦠, X k m ( x k 1 , x k 2 , ⦠, x k m ) = ā« ā ā ā ⦠⫠ā ā ā F ( x 1 , x 2 , ⦠, x n ) ā j ā k i d x j F_{X_{k_1}, X_{k_2}, \ldots, X_{k_m}}(x_{k_1}, x_{k_2}, \ldots, x_{k_m}) =
\int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} F(x_1, x_2, \ldots, x_n)
\prod_{j \neq k_i}dx_j F X k 1 ā ā , X k 2 ā ā , ⦠, X k m ā ā ā ( x k 1 ā ā , x k 2 ā ā , ⦠, x k m ā ā ) = ā« ā ā ā ā ⦠⫠ā ā ā ā F ( x 1 ā , x 2 ā , ⦠, x n ā ) j ī = k i ā ā ā d x j ā
Two random variables X X X and Y Y Y are independent if and only if
F ( x , y ) = F X ( x ) F Y ( y ) F(x, y) = F_X(x) F_Y(y) F ( x , y ) = F X ā ( x ) F Y ā ( y )
for all x x x and y y y .
This can be thought as the joint behaviour of X X X and Y Y Y is the product of the marginal behaviour of X X X and Y Y Y .
The definition can also be formulated in terms of the joint PDF:
f ( x , y ) = f X ( x ) f Y ( y ) f(x, y) = f_X(x) f_Y(y) f ( x , y ) = f X ā ( x ) f Y ā ( y )
To show that two random variables are not independent, we only need to find one pair of x x x and y y y such that the equation does not hold.
Given two independent random variables X X X and Y Y Y , and a function g ( X ) g(X) g ( X ) , and a function h ( Y ) h(Y) h ( Y ) , the random variables Z = g ( X ) Z = g(X) Z = g ( X ) and W = h ( Y ) W = h(Y) W = h ( Y ) are also independent.
A set of random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā are mutually independent if and only if
F ( x 1 , x 2 , ⦠, x n ) = F X 1 ( x 1 ) F X 2 ( x 2 ) ⦠F X n ( x n ) F(x_1, x_2, \ldots, x_n) = F_{X_1}(x_1) F_{X_2}(x_2) \ldots F_{X_n}(x_n) F ( x 1 ā , x 2 ā , ⦠, x n ā ) = F X 1 ā ā ( x 1 ā ) F X 2 ā ā ( x 2 ā ) ⦠F X n ā ā ( x n ā )
Note: Mutual independence implies pairwise independence, however, the converse is not true.
A set of random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā are identically independent (IID) if and only if
They are mutually independent.
They have the same distribution.
Given two independent random variables X X X and Y Y Y , the sum of X X X and Y Y Y is defined as
Z = X + Y Z = X + Y Z = X + Y
Then, the CDF of Z Z Z can be calculated as
F Z ( z ) = P ( Z ⤠z ) = P ( X + Y ⤠z ) = ā« ā ā ā P ( X + Y ⤠z ⣠X = x ) f X ( x ) d x = ā« ā ā ā P ( Y ⤠z ā x ) f X ( x ) d x = ā« ā ā ā F Y ( z ā x ) f X ( x ) d x \begin{align}
F_Z(z) &= P(Z \leq z) \\
&= P(X + Y \leq z) \\
&= \int_{-\infty}^{\infty} P(X + Y \leq z | X = x) f_X(x) dx \\
&= \int_{-\infty}^{\infty} P(Y \leq z - x) f_X(x) dx \\
&= \int_{-\infty}^{\infty} F_Y(z - x) f_X(x) dx
\end{align} F Z ā ( z ) ā = P ( Z ⤠z ) = P ( X + Y ⤠z ) = ā« ā ā ā ā P ( X + Y ⤠z ⣠X = x ) f X ā ( x ) d x = ā« ā ā ā ā P ( Y ⤠z ā x ) f X ā ( x ) d x = ā« ā ā ā ā F Y ā ( z ā x ) f X ā ( x ) d x ā ā
The PDF of Z Z Z can be calculated as
f Z ( z ) = ā« ā ā ā f Y ( z ā x ) f X ( x ) d x f_Z(z) = \int_{-\infty}^{\infty} f_Y(z - x) f_X(x) dx f Z ā ( z ) = ā« ā ā ā ā f Y ā ( z ā x ) f X ā ( x ) d x
This is called the convolution of the PDFs of X X X and Y Y Y .
Let the vector X = ( X 1 , X 2 , ⦠, X n ) \mathbf{X} = (X_1, X_2, \ldots, X_n) X = ( X 1 ā , X 2 ā , ⦠, X n ā ) be a set of random variables,
and for Lebesgue-measurable functions g : R n ā R g: \mathbb{R}^n \rightarrow \mathbb{R} g : R n ā R
the expectation of X \mathbf{X} X is defined as
E [ g ( X ) ] = ā« R n g ( x ) f ( x ) d X E[g(\mathbf{X})] = \int_{\mathbb{R}^n} g(\mathbf{x}) f(\mathbf{x}) d\mathbf{X} E [ g ( X )] = ā« R n ā g ( x ) f ( x ) d X
E [ a g ( X ) + b h ( X ) + c ] = a E [ g ( X ) ] + b E [ h ( X ) ] + c E[a g(\mathbf{X}) + bh(\mathbf{X}) + c] = a E[g(\mathbf{X})] + b E[h(\mathbf{X})] + c E [ a g ( X ) + bh ( X ) + c ] = a E [ g ( X )] + b E [ h ( X )] + c
If X X X and Y Y Y are independent, then E [ g ( X ) h ( Y ) ] = E [ g ( X ) ] E [ h ( Y ) ] E[g(X)h(Y)] = E[g(X)]E[h(Y)] E [ g ( X ) h ( Y )] = E [ g ( X )] E [ h ( Y )] for any functions g g g and h h h .
The covariance of two random variables X X X and Y Y Y is defined as
Cov ( X , Y ) = E [ ( X ā E [ X ] ) ( Y ā E [ Y ] ) ] \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] Cov ( X , Y ) = E [( X ā E [ X ]) ( Y ā E [ Y ])]
The covariance can be calculated as
Cov ( X , Y ) = E [ X Y ] ā E [ X ] E [ Y ] \text{Cov}(X, Y) = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [ X Y ] ā E [ X ] E [ Y ]
Cov ( X , Y ) = Cov ( Y , X ) \text{Cov}(X, Y) = \text{Cov}(Y, X) Cov ( X , Y ) = Cov ( Y , X )
Cov ( X , X ) = Var ( X ) \text{Cov}(X, X) = \text{Var}(X) Cov ( X , X ) = Var ( X )
Cov ( a X + b , c Y + d ) = a c Cov ( X , Y ) \text{Cov}(aX + b, cY + d) = ac \text{Cov}(X, Y) Cov ( a X + b , c Y + d ) = a c Cov ( X , Y )
Cov ( ā a i X i , ā b i Y i ) = ā a i b j Cov ( X i , Y j ) \text{Cov}(\sum a_iX_{i}, \sum b_iY_{i}) = \sum a_i b_j \text{Cov}(X_i,Y_j) Cov ( ā a i ā X i ā , ā b i ā Y i ā ) = ā a i ā b j ā Cov ( X i ā , Y j ā )
Cov ( X , Y ) = E ( X Y ) ā E ( X ) E ( Y ) \text{Cov}(X,Y) = E(XY) - E(X)E(Y) Cov ( X , Y ) = E ( X Y ) ā E ( X ) E ( Y )
⣠Cov ( X , Y ) ⣠⤠Var ( X ) Var ( Y ) |\text{Cov}(X, Y)| \leq \sqrt{\text{Var}(X) \text{Var}(Y)} ⣠Cov ( X , Y ) ⣠⤠Var ( X ) Var ( Y ) ā
This equality holds if and only if X X X and Y Y Y are linearly related.
The correlation coefficient of two random variables X X X and Y Y Y is defined as
Ļ ( X , Y ) = Cov ( X , Y ) Var ( X ) Var ( Y ) \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} Ļ ( X , Y ) = Var ( X ) Var ( Y ) ā Cov ( X , Y ) ā
By Cauchy-Schwarz Inequality , we have ā 1 ā¤ Ļ ( X , Y ) ⤠1 -1 \leq \rho(X, Y) \leq 1 ā 1 ā¤ Ļ ( X , Y ) ⤠1 .
The n-th (raw) moment of a random variable X X X is defined as E [ X n ] E[X^n] E [ X n ] ,
and the n-th central moment of a random variable X X X is defined as E [ ( X ā E [ X ] ) n ] E[(X - E[X])^n] E [( X ā E [ X ] ) n ] .
The joint (raw) moment of random variables X X X and Y Y Y is defined as E [ X i Y j ] E[X^iY^j] E [ X i Y j ] ,
and the joint central moment of random variables X X X and Y Y Y is defined as E [ ( X ā E [ X ] ) i ( Y ā E [ Y ] ) j ] E[(X - E[X])^i(Y - E[Y])^j] E [( X ā E [ X ] ) i ( Y ā E [ Y ] ) j ] .
Given a set of random variables X = ( X 1 , X 2 , ⦠, X n ) T \mathbf{X} = (X_1, X_2, \ldots, X_n)^T X = ( X 1 ā , X 2 ā , ⦠, X n ā ) T ,
E [ X ] = ( E [ X 1 ] , E [ X 2 ] , ⦠, E [ X n ] ) T E[\mathbf{X}] = (E[X_1], E[X_2], \ldots, E[X_n])^T E [ X ] = ( E [ X 1 ā ] , E [ X 2 ā ] , ⦠, E [ X n ā ] ) T
which is a vector of n Ć 1 n\times1 n Ć 1 dimensions vector.
The covariance matrix of X \mathbf{X} X is defined as
Cov ( X ) = E [ ( X ā E [ X ] ) ( X ā E [ X ] ) T ] \text{Cov}(\mathbf{X}) = E[(\mathbf{X} - E[\mathbf{X}])(\mathbf{X} - E[\mathbf{X}])^T] Cov ( X ) = E [( X ā E [ X ]) ( X ā E [ X ] ) T ]
which is a n Ć n n\times n n Ć n matrix.
Given two random variables X X X and Y Y Y , the conditional probability mass function (PMF) of X X X given Y = y Y = y Y = y is defined as
p X ⣠Y ( x ⣠y ) = p ( x , y ) p Y ( y ) p_{X|Y}(x|y) = \frac{p(x, y)}{p_Y(y)} p X ⣠Y ā ( x ⣠y ) = p Y ā ( y ) p ( x , y ) ā
Given two random variables X X X and Y Y Y , the conditional commutative distribution function (CDF) of X X X given Y = y Y = y Y = y is defined as
F X ⣠Y ( x ⣠y ) = P ( X ⤠x ⣠Y = y ) F_{X|Y}(x|y) = P(X \leq x | Y = y) F X ⣠Y ā ( x ⣠y ) = P ( X ⤠x ⣠Y = y )
Given two random variables X X X and Y Y Y , the conditional probability density function (PDF) of X X X given Y = y Y = y Y = y is defined as
f X ⣠Y ( x ⣠y ) = f ( x , y ) f Y ( y ) f_{X|Y}(x|y) = \frac{f(x, y)}{f_Y(y)} f X ⣠Y ā ( x ⣠y ) = f Y ā ( y ) f ( x , y ) ā
Given two random variables X X X and Y Y Y , the conditional commutative distribution function (CDF) of X X X given Y = y Y = y Y = y is defined as
F X ⣠Y ( x ⣠y ) = P ( X ⤠x ⣠Y = y ) = ā« ā ā x f X ⣠Y ( u ⣠y ) d u F_{X|Y}(x|y) = P(X \leq x | Y = y) = \int_{-\infty}^{x} f_{X|Y}(u|y) du F X ⣠Y ā ( x ⣠y ) = P ( X ⤠x ⣠Y = y ) = ā« ā ā x ā f X ⣠Y ā ( u ⣠y ) d u
Given two random variables X X X and Y Y Y , the conditional expectation of X X X given Y = y Y = y Y = y is defined as
E [ X ⣠Y = y ] = ā x x p X ⣠Y ( x ⣠y ) forĀ discreteĀ randomĀ variables E[X|Y = y] = \sum_{x} x p_{X|Y}(x|y) \quad \text{for discrete random variables} E [ X ⣠Y = y ] = x ā ā x p X ⣠Y ā ( x ⣠y ) forĀ discreteĀ randomĀ variables
E [ X ⣠Y = y ] = ā« ā ā ā x f X ⣠Y ( x ⣠y ) d x forĀ continuousĀ randomĀ variables E[X|Y = y] = \int_{-\infty}^{\infty} x f_{X|Y}(x|y) dx \quad \text{for continuous random variables} E [ X ⣠Y = y ] = ā« ā ā ā ā x f X ⣠Y ā ( x ⣠y ) d x forĀ continuousĀ randomĀ variables
We can also define function of Y Y Y as
Ļ ( y ) = E [ X ⣠Y = y ] \psi(y) = E[X|Y = y] Ļ ( y ) = E [ X ⣠Y = y ]
This is a random variable, and we call this the conditional expectation of X X X given Y Y Y .
Given two random variables X X X and Y Y Y , the law of iterated expectations states that
E [ E [ X ⣠Y ] ] = E [ X ] E[E[X|Y]] = E[X] E [ E [ X ⣠Y ]] = E [ X ]
Proof:
E [ E [ X ⣠Y ] ] = ā« ā ā ā E [ X ⣠Y = y ] f Y ( y ) d y = ā« ā ā ā ( ā« ā ā ā x f X ⣠Y ( x ⣠y ) d x ) f Y ( y ) d y = ā« ā ā ā ā« ā ā ā x f ( x , y ) d x d y = E [ X ] \begin{align}
E[E[X|Y]] &= \int_{-\infty}^{\infty} E[X|Y = y] f_Y(y) dy \\
&= \int_{-\infty}^{\infty} \left( \int_{-\infty}^{\infty} x f_{X|Y}(x|y) dx \right) f_Y(y) dy \\
&= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x f(x, y) dx dy \\
&= E[X]
\end{align} E [ E [ X ⣠Y ]] ā = ā« ā ā ā ā E [ X ⣠Y = y ] f Y ā ( y ) d y = ā« ā ā ā ā ( ā« ā ā ā ā x f X ⣠Y ā ( x ⣠y ) d x ) f Y ā ( y ) d y = ā« ā ā ā ā ā« ā ā ā ā x f ( x , y ) d x d y = E [ X ] ā ā
Given a random variable X X X and an event A A A , the law of total probability states that
P ( A ) = ā« ā ā ā P ( A ⣠X = x ) f X ( x ) d x P(A) = \int_{-\infty}^{\infty} P(A|X = x) f_X(x) dx P ( A ) = ā« ā ā ā ā P ( A ⣠X = x ) f X ā ( x ) d x
Given a random variable X X X and a stopping time N N N which is an integer valued random variable, then
E [ ā i = 1 N X ] = E [ X ] E [ N ] E[\sum_{i=1}^{N}X] = E[X]E[N] E [ i = 1 ā N ā X ] = E [ X ] E [ N ]
E [ a X + b Y + c ⣠Z ] = a E [ X ⣠Z ] + b E [ Y ⣠Z ] + c E[aX + bY + c|Z] = aE[X|Z] + bE[Y|Z] + c E [ a X + bY + c ⣠Z ] = a E [ X ⣠Z ] + b E [ Y ⣠Z ] + c
If Y > 0 Y > 0 Y > 0 , then E [ X ⣠Y ] > 0 E[X|Y] > 0 E [ X ⣠Y ] > 0 .
If X X X and Y Y Y are independent, then E [ X ⣠Y ] = E [ X ] E[X|Y] = E[X] E [ X ⣠Y ] = E [ X ] .
For any function g g g and h h h , E [ g ( X ) h ( Y ) ⣠Y ] = h ( Y ) E [ g ( X ) ⣠Y ] E[g(X)h(Y)|Y] = h(Y)E[g(X)|Y] E [ g ( X ) h ( Y ) ⣠Y ] = h ( Y ) E [ g ( X ) ⣠Y ] .
Given two random variables X X X and Y Y Y , the conditional variance of X X X given Y Y Y is defined as
Var ( X ⣠Y ) = E [ ( X ā E [ X ⣠Y ] ) 2 ⣠Y ] \text{Var}(X|Y) = E[(X - E[X|Y])^2|Y] Var ( X ⣠Y ) = E [( X ā E [ X ⣠Y ] ) 2 ⣠Y ]
The conditional variance can be calculated as
Var ( X ⣠Y ) = E [ X 2 ⣠Y ] ā E [ X ⣠Y ] 2 \text{Var}(X|Y) = E[X^2|Y] - E[X|Y]^2 Var ( X ⣠Y ) = E [ X 2 ⣠Y ] ā E [ X ⣠Y ] 2
Note that the conditional variance is a random variable of Y Y Y .
Given a random variable X X X and an event A A A , the law of total variance states that
Var ( X ) = E [ Var ( X ⣠Y ) ] + Var ( E [ X ⣠Y ] ) \text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E[X|Y]) Var ( X ) = E [ Var ( X ⣠Y )] + Var ( E [ X ⣠Y ])
Given a random variable X X X with a PDF f X ( x ) f_X(x) f X ā ( x ) , the support of f X ( x ) f_X(x) f X ā ( x ) is the set of values of x x x where f X ( x ) > 0 f_X(x) > 0 f X ā ( x ) > 0 .
Given a random variable X X X with a PDF f X ( x ) f_X(x) f X ā ( x ) , and a function Y = g ( X ) Y = g(X) Y = g ( X ) , if g g g is a monotonic function, then the CDF of Y Y Y is
F Y ( y ) = { F X ( g ā 1 ( y ) ) ifĀ g Ā isĀ increasing 1 ā F X ( g ā 1 ( y ) ) ifĀ g Ā isĀ decreasing F_Y(y) = \begin{cases}
F_X(g^{-1}(y)) & \text{if } g \text{ is increasing} \\
1 - F_X(g^{-1}(y)) & \text{if } g \text{ is decreasing}
\end{cases} F Y ā ( y ) = { F X ā ( g ā 1 ( y )) 1 ā F X ā ( g ā 1 ( y )) ā ifĀ g Ā isĀ increasing ifĀ g Ā isĀ decreasing ā
Then, the PDF of Y Y Y is
f Y ( y ) = f X ( g ā 1 ( y ) ) ⣠d d y g ā 1 ( y ) ⣠f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right| f Y ā ( y ) = f X ā ( g ā 1 ( y )) ā d y d ā g ā 1 ( y ) ā
For non-monotonic transformations, we can break the transformation into monotonic parts.
Given random variables X 1 X_1 X 1 ā and X 2 X_2 X 2 ā with a joint PDF f X 1 , X 2 ( x 1 , x 2 ) f_{X_1, X_2}(x_1, x_2) f X 1 ā , X 2 ā ā ( x 1 ā , x 2 ā ) , and functions ( Y 1 , Y 2 ) = T ( X 1 , X 2 ) (Y_1, Y_2) = T(X_1, X_2) ( Y 1 ā , Y 2 ā ) = T ( X 1 ā , X 2 ā ) , where T : R 2 ā R 2 T: \mathbb{R}^2 \rightarrow \mathbb{R}^2 T : R 2 ā R 2 is a one-to-one transformation,
and let H = T ā 1 H = T^{-1} H = T ā 1 .
We define J H J_{H} J H ā , which is the Jacobian determinate of H H H as
J H = ⣠ā ( H 1 , H 2 ) ā ( x 1 , x 2 ) ⣠= det ā” [ ā H 1 ā x 1 ā H 1 ā x 2 ā H 2 ā x 1 ā H 2 ā x 2 ] J_{H} = \left| \frac{\partial (H_1, H_2)}{\partial (x_1, x_2)} \right|
= \det\begin{bmatrix}
\frac{\partial H_1}{\partial x_1} & \frac{\partial H_1}{\partial x_2} \\
\frac{\partial H_2}{\partial x_1} & \frac{\partial H_2}{\partial x_2}
\end{bmatrix} J H ā = ā ā ( x 1 ā , x 2 ā ) ā ( H 1 ā , H 2 ā ) ā ā = det [ ā x 1 ā ā H 1 ā ā ā x 1 ā ā H 2 ā ā ā ā x 2 ā ā H 1 ā ā ā x 2 ā ā H 2 ā ā ā ]
Then, the joint PDF of ( Y 1 , Y 2 ) (Y_1, Y_2) ( Y 1 ā , Y 2 ā ) is
f Y 1 , Y 2 ( y 1 , y 2 ) = f X 1 , X 2 ( H 1 ( y 1 , y 2 ) , H 2 ( y 1 , y 2 ) ) ⣠J H ⣠f_{Y_1, Y_2}(y_1, y_2) = f_{X_1, X_2}(H_1(y_1, y_2), H_2(y_1, y_2)) |J_{H}| f Y 1 ā , Y 2 ā ā ( y 1 ā , y 2 ā ) = f X 1 ā , X 2 ā ā ( H 1 ā ( y 1 ā , y 2 ā ) , H 2 ā ( y 1 ā , y 2 ā )) ⣠J H ā ā£
Note: The Jacobian determinate satisfy: J H = J H ā 1 ā 1 = J T ā 1 J_{H} = J_{H^{-1}}^{-1} = J_T^{-1} J H ā = J H ā 1 ā 1 ā = J T ā 1 ā .
The theorem from Transformation of Bivariate Random Variables can be generalized to multiple random variables.
Given random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā with a joint PDF f X 1 , X 2 , ⦠, X n ( x 1 , x 2 , ⦠, x n ) f_{X_1, X_2, \ldots, X_n}(x_1, x_2, \ldots, x_n) f X 1 ā , X 2 ā , ⦠, X n ā ā ( x 1 ā , x 2 ā , ⦠, x n ā ) , and functions ( Y 1 , Y 2 , ⦠, Y n ) = T ( X 1 , X 2 , ⦠, X n ) (Y_1, Y_2, \ldots, Y_n) = T(X_1, X_2, \ldots, X_n) ( Y 1 ā , Y 2 ā , ⦠, Y n ā ) = T ( X 1 ā , X 2 ā , ⦠, X n ā ) , where T : R n ā R n T: \mathbb{R}^n \rightarrow \mathbb{R}^n T : R n ā R n is a one-to-one transformation,
and let H = T ā 1 H = T^{-1} H = T ā 1 .
We define J H J_{H} J H ā , which is the Jacobian determinate of H H H as
J H = ⣠ā ( H 1 , H 2 , ⦠, H n ) ā ( x 1 , x 2 , ⦠, x n ) ⣠= det ā” [ ā H 1 ā x 1 ā H 1 ā x 2 ⦠ā H 1 ā x n ā H 2 ā x 1 ā H 2 ā x 2 ⦠ā H 2 ā x n ā® ā® ā± ā® ā H n ā x 1 ā H n ā x 2 ⦠ā H n ā x n ] J_{H} = \left| \frac{\partial (H_1, H_2, \ldots, H_n)}{\partial (x_1, x_2, \ldots, x_n)} \right|
= \det\begin{bmatrix}
\frac{\partial H_1}{\partial x_1} & \frac{\partial H_1}{\partial x_2} & \ldots & \frac{\partial H_1}{\partial x_n} \\
\frac{\partial H_2}{\partial x_1} & \frac{\partial H_2}{\partial x_2} & \ldots & \frac{\partial H_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial H_n}{\partial x_1} & \frac{\partial H_n}{\partial x_2} & \ldots & \frac{\partial H_n}{\partial x_n}
\end{bmatrix} J H ā = ā ā ( x 1 ā , x 2 ā , ⦠, x n ā ) ā ( H 1 ā , H 2 ā , ⦠, H n ā ) ā ā = det ā ā x 1 ā ā H 1 ā ā ā x 1 ā ā H 2 ā ā ā® ā x 1 ā ā H n ā ā ā ā x 2 ā ā H 1 ā ā ā x 2 ā ā H 2 ā ā ā® ā x 2 ā ā H n ā ā ā ⦠⦠Ⱡ⦠ā ā x n ā ā H 1 ā ā ā x n ā ā H 2 ā ā ā® ā x n ā ā H n ā ā ā ā
Then, the joint PDF of ( Y 1 , Y 2 , ⦠, Y n ) (Y_1, Y_2, \ldots, Y_n) ( Y 1 ā , Y 2 ā , ⦠, Y n ā ) is
f Y 1 , Y 2 , ⦠, Y n ( y 1 , y 2 , ⦠, y n ) = f X 1 , X 2 , ⦠, X n ( H 1 ( y 1 , y 2 , ⦠, y n ) , H 2 ( y 1 , y 2 , ⦠, y n ) , ⦠, H n ( y 1 , y 2 , ⦠, y n ) ) ⣠J H ⣠f_{Y_1, Y_2, \ldots, Y_n}(y_1, y_2, \ldots, y_n) = f_{X_1, X_2, \ldots, X_n}(H_1(y_1, y_2, \ldots, y_n), H_2(y_1, y_2, \ldots, y_n), \ldots, H_n(y_1, y_2, \ldots, y_n)) |J_{H}| f Y 1 ā , Y 2 ā , ⦠, Y n ā ā ( y 1 ā , y 2 ā , ⦠, y n ā ) = f X 1 ā , X 2 ā , ⦠, X n ā ā ( H 1 ā ( y 1 ā , y 2 ā , ⦠, y n ā ) , H 2 ā ( y 1 ā , y 2 ā , ⦠, y n ā ) , ⦠, H n ā ( y 1 ā , y 2 ā , ⦠, y n ā )) ⣠J H ā ā£
Given a random variable X X X , the moment generating function (MGF) M X ( t ) M_X(t) M X ā ( t ) of X X X is defined as
M X ( t ) = E [ e t X ] M_X(t) = E[e^{tX}] M X ā ( t ) = E [ e tX ]
The domain of the MGF is the set of t t t such that M X ( t ) M_X(t) M X ā ( t ) exists and is finite.
If the domain does not contain an open neighbourhood of 0 0 0 ,
then we say the MGF does not exist.
Given a random variable X X X that follows the standard normal distribution ,
f X ( x ) = 1 2 Ļ e ā x 2 2 f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} f X ā ( x ) = 2 Ļ ā 1 ā e ā 2 x 2 ā
By definition,
M X ( t ) = E [ e t X ] = ā« ā ā ā e t x f X ( x ) d x = ā« ā ā ā 1 2 Ļ e t x ā x 2 2 d x = 1 2 Ļ ā« ā ā ā e t x ā x 2 2 d x = 1 2 Ļ ā« ā ā ā e ā 1 2 ( x 2 ā 2 t x ) d x = 1 2 Ļ ā« ā ā ā e ā 1 2 ( x ā t ) 2 + t 2 2 d x = e t 2 2 \begin{align}
M_X(t) &= E[e^{tX}] \\
&= \int_{-\infty}^{\infty} e^{tx} f_X(x) dx \\
&= \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{tx-\frac{x^2}{2}} dx \\
&= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{tx-\frac{x^2}{2}} dx \\
&= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-\frac{1}{2}(x^2 - 2tx)} dx \\
&= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-\frac{1}{2}(x - t)^2 + \frac{t^2}{2}} dx \\
&= e^{\frac{t^2}{2}}
\end{align} M X ā ( t ) ā = E [ e tX ] = ā« ā ā ā ā e t x f X ā ( x ) d x = ā« ā ā ā ā 2 Ļ ā 1 ā e t x ā 2 x 2 ā d x = 2 Ļ ā 1 ā ā« ā ā ā ā e t x ā 2 x 2 ā d x = 2 Ļ ā 1 ā ā« ā ā ā ā e ā 2 1 ā ( x 2 ā 2 t x ) d x = 2 Ļ ā 1 ā ā« ā ā ā ā e ā 2 1 ā ( x ā t ) 2 + 2 t 2 ā d x = e 2 t 2 ā ā ā
Given a random variable X X X that follows the exponential distribution ,
f X ( x ) = Ī» e ā Ī» x f_X(x) = \lambda e^{-\lambda x} f X ā ( x ) = Ī» e ā Ī» x
By definition,
M X ( t ) = E [ e t X ] = ā« 0 ā e t x Ī» e ā Ī» x d x = Ī» ā« 0 ā e t x ā Ī» x d x = Ī» t ā Ī» ā« 0 ā ( t ā Ī» ) e ( t ā Ī» ) x d x = Ī» t ā Ī» \begin{align}
M_{X}(t) &= E[e^{tX}] \\
&= \int_{0}^{\infty} e^{tx} \lambda e^{-\lambda x} dx \\
&= \lambda \int_{0}^{\infty} e^{tx-\lambda x} dx \\
&= \frac{\lambda}{t-\lambda} \int_{0}^{\infty} (t-\lambda) e^{(t-\lambda)x} dx \\
&= \frac{\lambda}{t-\lambda}
\end{align} M X ā ( t ) ā = E [ e tX ] = ā« 0 ā ā e t x Ī» e ā Ī» x d x = Ī» ā« 0 ā ā e t x ā Ī» x d x = t ā Ī» Ī» ā ā« 0 ā ā ( t ā Ī» ) e ( t ā Ī» ) x d x = t ā Ī» Ī» ā ā ā
M X ( 0 ) = E [ 1 ] = 1 M_X(0) = E[1] = 1 M X ā ( 0 ) = E [ 1 ] = 1
The n-th derivative of the MGF at t = 0 t = 0 t = 0 is:
M X ( n ) ( 0 ) = E [ X n e t X ] ( 0 ) = E [ X n ] M_X^{(n)}(0) = E[X^{n}e^{tX}](0) = E[X^{n}] M X ( n ) ā ( 0 ) = E [ X n e tX ] ( 0 ) = E [ X n ]
By the previous property, the Maclaurin series of the MGF is:
M X ( t ) = ā n = 0 ā E [ X n ] n ! t n M_X(t) = \sum_{n=0}^{\infty} \frac{E[X^{n}]}{n!} t^{n} M X ā ( t ) = n = 0 ā ā ā n ! E [ X n ] ā t n
Also, if X X X have MGF M X ( t ) M_X(t) M X ā ( t ) and Y = a X + b Y = aX + b Y = a X + b , then Y Y Y have MGF M Y ( t ) = e t b M X ( a t ) M_Y(t) = e^{tb}M_X(at) M Y ā ( t ) = e t b M X ā ( a t ) .
Given two random variables X X X and Y Y Y with MGF M X ( t ) M_X(t) M X ā ( t ) and M Y ( t ) M_Y(t) M Y ā ( t ) , if M X ( t ) = M Y ( t ) M_X(t) = M_Y(t) M X ā ( t ) = M Y ā ( t ) for all t t t in an open neighbourhood of 0 0 0 , then X X X and Y Y Y have the same distribution.
The joint moment generating function (JMGF) of random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā is defined as a function from R n \mathbb{R}^n R n to R \mathbb{R} R :
M X ( t ) = E [ e t T X ] M_\mathbf{X}(\mathbf{t}) = E[e^{\mathbf{t}^T \mathbf{X}}] M X ā ( t ) = E [ e t T X ]
where t = ( t 1 , t 2 , ⦠, t n ) T \mathbf{t} = (t_1, t_2, \ldots, t_n)^T t = ( t 1 ā , t 2 ā , ⦠, t n ā ) T and X = ( X 1 , X 2 , ⦠, X n ) T \mathbf{X} = (X_1, X_2, \ldots, X_n)^T X = ( X 1 ā , X 2 ā , ⦠, X n ā ) T .
If the JMGF exists and is finite on a open neighbourhood of 0 \mathbf{0} 0 , then we say the JMGF exists.
If the JMGF exists and is finite on a open neighbourhood of 0 \mathbf{0} 0 ,
then it uniquely determines the joint distribution of X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā .
The MGF of X i X_i X i ā can be expressed as:
M X i ( t i ) = M X ( 0 , ⦠, 0 , t i , 0 , ⦠, 0 ) M_{X_i}(t_i) = M_{\mathbf{X}}(0, \ldots, 0, t_i, 0, \ldots, 0) M X i ā ā ( t i ā ) = M X ā ( 0 , ⦠, 0 , t i ā , 0 , ⦠, 0 )
The joint moment of X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā can be expressed as:
E [ X 1 i 1 X 2 i 2 ⦠X n i n ] = ā i 1 + i 2 + ⦠+ i n M X ( t ) ā t 1 i 1 ā t 2 i 2 ⦠ā t n i n ⣠t = 0 E[X_1^{i_1}X_2^{i_2}\ldots X_n^{i_n}] = \left. \frac{\partial^{i_1 + i_2 + \ldots + i_n} M_{\mathbf{X}}(\mathbf{t})}{\partial t_1^{i_1} \partial t_2^{i_2} \ldots \partial t_n^{i_n}} \right|_{\mathbf{t} = \mathbf{0}} E [ X 1 i 1 ā ā X 2 i 2 ā ā ⦠X n i n ā ā ] = ā t 1 i 1 ā ā ā t 2 i 2 ā ā ⦠ā t n i n ā ā ā i 1 ā + i 2 ā + ⦠+ i n ā M X ā ( t ) ā ā t = 0 ā
Given random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā with MGF M X i ( t i ) M_{X_i}(t_i) M X i ā ā ( t i ā ) ,
and JMGF M X 1 , X 2 , ⦠, X n ( t ) M_{X_1, X_2, \ldots, X_n}(\mathbf{t}) M X 1 ā , X 2 ā , ⦠, X n ā ā ( t ) ,
then X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā are mutually independent if and only if
M X 1 , X 2 , ⦠, X n ( t ) = M X 1 ( t 1 ) M X 2 ( t 2 ) ⦠M X n ( t n ) M_{X_1, X_2, \ldots, X_n}(\mathbf{t}) = M_{X_1}(t_1)M_{X_2}(t_2)\ldots M_{X_n}(t_n) M X 1 ā , X 2 ā , ⦠, X n ā ā ( t ) = M X 1 ā ā ( t 1 ā ) M X 2 ā ā ( t 2 ā ) ⦠M X n ā ā ( t n ā )
Given random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā that are independent,
and S = a 1 X 1 + a 2 X 2 + ⦠+ a n X n S = a_1X_1 + a_2X_2 + \ldots + a_nX_n S = a 1 ā X 1 ā + a 2 ā X 2 ā + ⦠+ a n ā X n ā ,
then the MGF of S S S is
M S ( t ) = M X 1 ( a 1 t ) M X 2 ( a 2 t ) ⦠M X n ( a n t ) M_S(t) = M_{X_1}(a_1t)M_{X_2}(a_2t)\ldots M_{X_n}(a_nt) M S ā ( t ) = M X 1 ā ā ( a 1 ā t ) M X 2 ā ā ( a 2 ā t ) ⦠M X n ā ā ( a n ā t )
Given a random variable X X X that takes non-negative integer values,
the probability generating function (PGF) G X ( z ) G_X(z) G X ā ( z ) of X X X is defined as
Ļ X ( z ) = E [ z X ] = ā x = 0 ā z x P ( X = x ) \phi_X(z) = E[z^X] = \sum_{x=0}^{\infty} z^x P(X = x) Ļ X ā ( z ) = E [ z X ] = x = 0 ā ā ā z x P ( X = x )
Ļ X ( 1 ) = 1 \phi_X(1) = 1 Ļ X ā ( 1 ) = 1
The PMF of X X X is uniquely determined by Ļ X ( z ) \phi_X(z) Ļ X ā ( z ) .
The n-th factorial moment of X X X is
E [ X ( X ā 1 ) ⦠( X ā n + 1 ) ] = d n Ļ X ( z ) d z n ⣠z = 1 E[X(X-1)\ldots(X-n+1)] = \left. \frac{d^n \phi_X(z)}{dz^n} \right|_{z=1} E [ X ( X ā 1 ) ⦠( X ā n + 1 )] = d z n d n Ļ X ā ( z ) ā ā z = 1 ā
Random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā are mutually independent if and only if
the joint PGF Ļ X 1 , X 2 , ⦠, X n ( z 1 , z 2 , ⦠, z n ) \phi_{X_1, X_2, \ldots, X_n}(z_1, z_2, \ldots, z_n) Ļ X 1 ā , X 2 ā , ⦠, X n ā ā ( z 1 ā , z 2 ā , ⦠, z n ā ) is
Ļ X 1 , X 2 , ⦠, X n ( z 1 , z 2 , ⦠, z n ) = E [ z 1 X 1 z 2 X 2 ⦠z n X n ] = Ļ X 1 ( z 1 ) Ļ X 2 ( z 2 ) ā¦ Ļ X n ( z n ) \phi_{X_1, X_2, \ldots, X_n}(z_1, z_2, \ldots, z_n) = E[z_1^{X_1}z_2^{X_2}\ldots z_n^{X_n}] =\phi_{X_1}(z_1)\phi_{X_2}(z_2)\ldots \phi_{X_n}(z_n) Ļ X 1 ā , X 2 ā , ⦠, X n ā ā ( z 1 ā , z 2 ā , ⦠, z n ā ) = E [ z 1 X 1 ā ā z 2 X 2 ā ā ⦠z n X n ā ā ] = Ļ X 1 ā ā ( z 1 ā ) Ļ X 2 ā ā ( z 2 ā ) ā¦ Ļ X n ā ā ( z n ā )
The PGF of sum of independent random variables X 1 , X 2 , ⦠, X n X_1, X_2, \ldots, X_n X 1 ā , X 2 ā , ⦠, X n ā is
Ļ X 1 + X 2 + ⦠+ X n ( z ) = Ļ X 1 ( z ) Ļ X 2 ( z ) ā¦ Ļ X n ( z ) \phi_{X_1 + X_2 + \ldots + X_n}(z) = \phi_{X_1}(z)\phi_{X_2}(z)\ldots \phi_{X_n}(z) Ļ X 1 ā + X 2 ā + ⦠+ X n ā ā ( z ) = Ļ X 1 ā ā ( z ) Ļ X 2 ā ā ( z ) ā¦ Ļ X n ā ā ( z )
Given a random variable X X X that takes non-negative integer values,
and the PGF Ļ X ( z ) \phi_X(z) Ļ X ā ( z ) and MGF M X ( t ) M_X(t) M X ā ( t ) of X X X ,
then
Ļ X ( e t ) = M X ( t ) M X ( ln ā” ( t ) ) = Ļ X ( t ) \begin{align}
\phi_X(e^t) = M_X(t) \\
M_X(\ln(t)) = \phi_X(t)
\end{align} Ļ X ā ( e t ) = M X ā ( t ) M X ā ( ln ( t )) = Ļ X ā ( t ) ā ā
Given a non-negative random variable X X X and a > 0 a > 0 a > 0 ,
then
P ( X ā„ a ) ⤠E [ X ] a P(X \geq a) \leq \frac{E[X]}{a} P ( X ā„ a ) ⤠a E [ X ] ā
Proof:
P ( X ā„ c ) = ā« c ā f X ( x ) d x = 1 c ā« c ā c f X ( x ) d x ⤠1 c ā« c ā x f X ( x ) d x ⤠1 c ā« 0 ā x f X ( x ) d x = E [ X ] c \begin{align}
P(X \geq c) &= \int_c^{\infty} f_X(x) dx \\
&= \frac{1}{c} \int_c^{\infty} cf_X(x) dx \\
&\le \frac{1}{c} \int_c^{\infty} xf_X(x) dx \\
&\le \frac{1}{c} \int_0^{\infty} xf_X(x) dx
&= \frac{E[X]}{c}
\end{align} P ( X ā„ c ) ā = ā« c ā ā f X ā ( x ) d x = c 1 ā ā« c ā ā c f X ā ( x ) d x ⤠c 1 ā ā« c ā ā x f X ā ( x ) d x ⤠c 1 ā ā« 0 ā ā x f X ā ( x ) d x ā = c E [ X ] ā ā ā
Given a random variable X X X with mean μ \mu μ and variance Ļ 2 \sigma^2 Ļ 2 ,
and a > 0 a > 0 a > 0 ,
then
P ( ⣠X ā μ ⣠℠a ) ā¤ Ļ 2 a 2 P(|X - \mu| \geq a) \leq \frac{\sigma^2}{a^2} P ( ⣠X ā μ ⣠℠a ) ⤠a 2 Ļ 2 ā
Proof:
Define Y = ( X ā μ ) 2 Y = (X - \mu)^2 Y = ( X ā μ ) 2 ,
then Y Y Y is a non-negative random variable,
and E [ Y ] = Var [ X ] = Ļ 2 E[Y] = \text{Var}[X] = \sigma^2 E [ Y ] = Var [ X ] = Ļ 2 .
By Markov Inequality ,
P ( Y ā„ a 2 ) ⤠E [ Y ] a 2 = Ļ 2 a 2 P(Y \geq a^2) \leq \frac{E[Y]}{a^2} = \frac{\sigma^2}{a^2} P ( Y ā„ a 2 ) ⤠a 2 E [ Y ] ā = a 2 Ļ 2 ā
Then,
P ( ⣠X ā μ ⣠℠a ) = P ( ( X ā μ ) 2 ā„ a 2 ) = P ( Y ā„ a 2 ) ā¤ Ļ 2 a 2 P(|X - \mu| \geq a) = P((X - \mu)^2 \geq a^2) = P(Y \geq a^2) \leq \frac{\sigma^2}{a^2} P ( ⣠X ā μ ⣠℠a ) = P (( X ā μ ) 2 ā„ a 2 ) = P ( Y ā„ a 2 ) ⤠a 2 Ļ 2 ā
We define the higher dimensional normal distribution as an analog of the one dimensional normal distribution .
We say a random vector X = ( X 1 , X 2 , ⦠, X n ) T \mathbf{X} = (X_1, X_2, \ldots, X_n)^T X = ( X 1 ā , X 2 ā , ⦠, X n ā ) T follows a multivariate normal distribution if it can be expressed as
X = μ + A Z \mathbf{X} = \mathbf{\mu} + \mathbf{A}\mathbf{Z} X = μ + AZ
where l ⤠n l\le n l ⤠n and μ \mathbf{\mu} μ is a vector of means, A \mathbf{A} A is a n à l n\times l n à l matrix of constants, and Z \mathbf{Z} Z is a l à 1 l\times 1 l à 1 vector of independent standard normal random variables.
In convention, we write Σ = A A T \Sigma = A A^T Σ = A A T , and we denote the multivariate normal distribution as
X ā¼ N n ( μ , Ī£ ) \mathbf{X} \sim N_n(\mathbf{\mu}, \Sigma) X ā¼ N n ā ( μ , Ī£ )
If we assume that Σ \Sigma Σ has full rank, We can use multivariate transformation to derive the joint PDF of X \mathbf{X} X :
f X ( x ) = f Z ( A ā 1 ( x ā μ ) ) ⣠det ā” ( A ā 1 ) ⣠= ⣠det ā” ( A ā 1 ) ⣠ā i = 1 n 1 2 Ļ e ā 1 2 z i 2 = ⣠det ā” ( A ā 1 ) ⣠1 2 Ļ n e ā 1 2 Z T Z = ⣠det ā” ( A ā 1 ) ⣠1 2 Ļ n e ā 1 2 ( A ā 1 ( x ā μ ) ) T A ā 1 ( x ā μ ) = ⣠det ā” ( A ā 1 ) ⣠1 2 Ļ n e ā 1 2 ( x ā μ ) T Ī£ ā 1 ( x ā μ ) \begin{align}
f_X(x) &= f_Z(A^{-1}(x-\mu))|\det(A^{-1})| \\
&= |\det(A^{-1})| \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}z_i^2} \\
&= |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}Z^TZ} \\
&= |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}(A^{-1}(x-\mu))^TA^{-1}(x-\mu)} \\
&= |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} \\
\end{align} f X ā ( x ) ā = f Z ā ( A ā 1 ( x ā μ )) ⣠det ( A ā 1 ) ⣠= ⣠det ( A ā 1 ) ⣠i = 1 ā n ā 2 Ļ ā 1 ā e ā 2 1 ā z i 2 ā = ⣠det ( A ā 1 ) ⣠2 Ļ ā n 1 ā e ā 2 1 ā Z T Z = ⣠det ( A ā 1 ) ⣠2 Ļ ā n 1 ā e ā 2 1 ā ( A ā 1 ( x ā μ ) ) T A ā 1 ( x ā μ ) = ⣠det ( A ā 1 ) ⣠2 Ļ ā n 1 ā e ā 2 1 ā ( x ā μ ) T Ī£ ā 1 ( x ā μ ) ā ā
If we assume that Σ \Sigma Σ has full rank,
the joint moment generating function of X \mathbf{X} X is
M X ( t ) = E [ e t T X ] = ā« R n e t T x f X ( x ) d x = ā« R n e t T x ⣠det ā” ( A ā 1 ) ⣠1 2 Ļ n e ā 1 2 ( x ā μ ) T Ī£ ā 1 ( x ā μ ) d x = ⣠det ā” ( A ā 1 ) ⣠2 Ļ n ā« R n e ā 1 2 [ 2 t T x + x T Ī£ ā 1 x ā x T Ī£ ā 1 μ ā μ T Ī£ ā 1 x + μ T Ī£ ā 1 μ ] d x = ⣠det ā” ( A ā 1 ) ⣠2 Ļ n ā« R n e ā 1 2 [ 2 ( Ī£ T t ) T Ī£ ā 1 x + x T Ī£ ā 1 x ā 2 μ T Ī£ ā 1 x + μ T Ī£ ā 1 μ ] d x = ⣠det ā” ( A ā 1 ) ⣠2 Ļ n ā« R n e ā 1 2 [ + x T Ī£ ā 1 x ā 2 ( μ ā Ī£ T t ) T Ī£ ā 1 x + μ T Ī£ ā 1 μ ] d x = ⣠det ā” ( A ā 1 ) ⣠2 Ļ n e ā 1 2 [ ā ( μ ā Ī£ T t ) T Ī£ ā 1 ( μ ā Ī£ T t ) + μ T Ī£ ā 1 μ ] ā« R n e ā 1 2 [ ( x ā μ ā Ī£ T t ) T Ī£ ā 1 ( x ā μ ā Ī£ T t ) ] d x = e ā 1 2 [ ā ( μ ā Ī£ T t ) T Ī£ ā 1 ( μ ā Ī£ T t ) + μ T Ī£ ā 1 μ ] = e t T μ + 1 2 t T Ī£ t \begin{align}
M_{X}(t) &= E[e^{t^TX}] \\
&= \int_{\mathbb{R}^n} e^{t^Tx} f_X(x) dx \\
&= \int_{\mathbb{R}^n} e^{t^Tx} |\det(A^{-1})| \frac{1}{\sqrt{2\pi}^n} e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
2t^Tx
+x^T\Sigma^{-1}x
-x^T\Sigma^{-1}\mu
-\mu^T\Sigma^{-1}x
+\mu^T\Sigma^{-1}\mu
\right]} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
2(\Sigma^{T}t)^T \Sigma^{-1}x
+x^T\Sigma^{-1}x
-2\mu^T\Sigma^{-1}x
+\mu^T\Sigma^{-1}\mu
\right]} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
+x^T\Sigma^{-1}x
-2(\mu-\Sigma^{T}t)^T\Sigma^{-1}x
+\mu^T\Sigma^{-1}\mu
\right]} dx \\
&= \frac{|\det(A^{-1})|}{\sqrt{2\pi}^n} e^{-\frac{1}{2}\left[
-(\mu-\Sigma^{T}t)^T\Sigma^{-1}(\mu-\Sigma^{T}t)
+\mu^T\Sigma^{-1}\mu
\right]} \int_{\mathbb{R}^n} e^{-\frac{1}{2}\left[
(x-\mu-\Sigma^{T}t)^T\Sigma^{-1}(x-\mu-\Sigma^{T}t)
\right]} dx \\
&= e^{-\frac{1}{2}\left[
-(\mu-\Sigma^{T}t)^T\Sigma^{-1}(\mu-\Sigma^{T}t)
+\mu^T\Sigma^{-1}\mu
\right]} \\
&= e^{t^T\mu + \frac{1}{2}t^T\Sigma t}
\end{align} M X ā ( t ) ā = E [ e t T X ] = ā« R n ā e t T x f X ā ( x ) d x = ā« R n ā e t T x ⣠det ( A ā 1 ) ⣠2 Ļ ā n 1 ā e ā 2 1 ā ( x ā μ ) T Ī£ ā 1 ( x ā μ ) d x = 2 Ļ ā n ⣠det ( A ā 1 ) ⣠ā ā« R n ā e ā 2 1 ā [ 2 t T x + x T Ī£ ā 1 x ā x T Ī£ ā 1 μ ā μ T Ī£ ā 1 x + μ T Ī£ ā 1 μ ] d x = 2 Ļ ā n ⣠det ( A ā 1 ) ⣠ā ā« R n ā e ā 2 1 ā [ 2 ( Ī£ T t ) T Ī£ ā 1 x + x T Ī£ ā 1 x ā 2 μ T Ī£ ā 1 x + μ T Ī£ ā 1 μ ] d x = 2 Ļ ā n ⣠det ( A ā 1 ) ⣠ā ā« R n ā e ā 2 1 ā [ + x T Ī£ ā 1 x ā 2 ( μ ā Ī£ T t ) T Ī£ ā 1 x + μ T Ī£ ā 1 μ ] d x = 2 Ļ ā n ⣠det ( A ā 1 ) ⣠ā e ā 2 1 ā [ ā ( μ ā Ī£ T t ) T Ī£ ā 1 ( μ ā Ī£ T t ) + μ T Ī£ ā 1 μ ] ā« R n ā e ā 2 1 ā [ ( x ā μ ā Ī£ T t ) T Ī£ ā 1 ( x ā μ ā Ī£ T t ) ] d x = e ā 2 1 ā [ ā ( μ ā Ī£ T t ) T Ī£ ā 1 ( μ ā Ī£ T t ) + μ T Ī£ ā 1 μ ] = e t T μ + 2 1 ā t T Ī£ t ā ā
By the joint moment generating function of multivariate normal distribution ,
E [ X 1 k 1 X 2 k 2 ⦠X n k n ] = ā k 1 + k 2 + ⦠+ k n M X ( t ) ā t 1 k 1 ā t 2 k 2 ⦠ā t n k n ⣠t = 0 E[X_{1}^{k_1}X_{2}^{k_2}\ldots X_{n}^{k_n}] = \left. \frac{\partial^{k_1 + k_2 + \ldots + k_n} M_{X}(\mathbf{t})}{\partial t_1^{k_1} \partial t_2^{k_2} \ldots \partial t_n^{k_n}} \right|_{\mathbf{t} = \mathbf{0}} E [ X 1 k 1 ā ā X 2 k 2 ā ā ⦠X n k n ā ā ] = ā t 1 k 1 ā ā ā t 2 k 2 ā ā ⦠ā t n k n ā ā ā k 1 ā + k 2 ā + ⦠+ k n ā M X ā ( t ) ā ā t = 0 ā
Especially, as,
ā ā t i M X ( t ) = ā ā t i e t T μ + 1 2 t T Ī£ t = [ ā t ā t i T μ + 1 2 ā t ā t i T Ī£ t + t T Ī£ ā t ā t i ] M X ( t ) = e i T [ μ + Ī£ t ] M X ( t ) \begin{align}
\frac{\partial}{\partial t_i} M_{X}(\mathbf{t}) &= \frac{\partial}{\partial t_i} e^{t^T\mu + \frac{1}{2}t^T\Sigma t} \\
&= \left[
\frac{\partial t}{\partial t_i}^T\mu + \frac{1}{2} \frac{\partial t}{\partial t_i}^T\Sigma t + t^T\Sigma \frac{\partial t}{\partial t_i}
\right] M_{X}(\mathbf{t})
&= e_i^T \left[
\mu + \Sigma t
\right] M_{X}(\mathbf{t})
\end{align} ā t i ā ā ā M X ā ( t ) ā = ā t i ā ā ā e t T μ + 2 1 ā t T Ī£ t = [ ā t i ā ā t ā T μ + 2 1 ā ā t i ā ā t ā T Ī£ t + t T Ī£ ā t i ā ā t ā ] M X ā ( t ) ā = e i T ā [ μ + Ī£ t ] M X ā ( t ) ā ā
ā 2 ā t j ā t i = ā ā t j e i T [ μ + Ī£ t ] M X ( t ) = e i T Ī£ e j T M X ( t ) + e i T [ μ + Ī£ t ] e j T [ μ + Ī£ t ] M X ( t ) = e i T Ī£ e j T M X ( t ) + e i T [ μ + Ī£ t ] [ μ + Ī£ t ] T e j M X ( t ) = e i T Ī£ e j T M X ( t ) + e i T [ μ μ T + μ ( Ī£ t ) T + Ī£ t μ T + Ī£ t ( Ī£ t ) T ] e j M X ( t ) \begin{align}
\frac{\partial^2}{\partial t_j\partial t_i} &= \frac{\partial}{\partial t_j} e_i^T \left[
\mu + \Sigma t
\right] M_{X}(\mathbf{t}) \\
&= e_i^T \Sigma e_j^T M_{X}(\mathbf{t}) + e_i^T \left[
\mu + \Sigma t
\right] e_j^T \left[
\mu + \Sigma t
\right] M_{X}(\mathbf{t}) \\
&= e_i^T \Sigma e_j^T M_{X}(\mathbf{t}) + e_i^T \left[
\mu + \Sigma t
\right] \left[
\mu + \Sigma t
\right]^T e_j M_{X}(\mathbf{t}) \\
&= e_i^T \Sigma e_j^T M_{X}(\mathbf{t}) + e_i^T \left[
\mu \mu^T
+ \mu (\Sigma t)^T
+ \Sigma t \mu^T
+ \Sigma t (\Sigma t)^T
\right] e_j M_{X}(\mathbf{t}) \\
\end{align} ā t j ā ā t i ā ā 2 ā ā = ā t j ā ā ā e i T ā [ μ + Ī£ t ] M X ā ( t ) = e i T ā Ī£ e j T ā M X ā ( t ) + e i T ā [ μ + Ī£ t ] e j T ā [ μ + Ī£ t ] M X ā ( t ) = e i T ā Ī£ e j T ā M X ā ( t ) + e i T ā [ μ + Ī£ t ] [ μ + Ī£ t ] T e j ā M X ā ( t ) = e i T ā Ī£ e j T ā M X ā ( t ) + e i T ā [ μ μ T + μ ( Ī£ t ) T + Ī£ t μ T + Ī£ t ( Ī£ t ) T ] e j ā M X ā ( t ) ā ā
where e i e_i e i ā is the i i i -th unit vector.
Then, we can calculate the moments of the multivariate normal distribution.
E ( X i ) = ā ā t i M X ( 0 ) = e i T [ μ ] M X ( 0 ) = μ i E(X_i) = \frac{\partial}{\partial t_i} M_{X}(0) = e_i^T \left[
\mu
\right] M_{X}(0) = \mu_i E ( X i ā ) = ā t i ā ā ā M X ā ( 0 ) = e i T ā [ μ ] M X ā ( 0 ) = μ i ā
E ( X i X j ) = ā 2 ā t j ā t i M X ( 0 ) = e i T Ī£ e j T M X ( 0 ) + e i T [ μ μ T + μ ( Ī£ 0 ) T + Ī£ 0 μ T + Ī£ 0 ( Ī£ 0 ) T ] e j M X ( 0 ) = e i T Ī£ e j T M X ( 0 ) = Ī£ i j + μ i μ j \begin{align}
E(X_iX_j) &= \frac{\partial^2}{\partial t_j\partial t_i} M_{X}(0) \\
&= e_i^T \Sigma e_j^T M_{X}(0) + e_i^T \left[
\mu \mu^T
+ \mu (\Sigma 0)^T
+ \Sigma 0 \mu^T
+ \Sigma 0 (\Sigma 0)^T
\right] e_j M_{X}(0) \\
&= e_i^T \Sigma e_j^T M_{X}(0) \\
&= \Sigma_{ij} + \mu_i\mu_j
\end{align} E ( X i ā X j ā ) ā = ā t j ā ā t i ā ā 2 ā M X ā ( 0 ) = e i T ā Ī£ e j T ā M X ā ( 0 ) + e i T ā [ μ μ T + μ ( Ī£0 ) T + Ī£0 μ T + Ī£0 ( Ī£0 ) T ] e j ā M X ā ( 0 ) = e i T ā Ī£ e j T ā M X ā ( 0 ) = Ī£ ij ā + μ i ā μ j ā ā ā
And the covariance:
Cov ( X i , X j ) = E ( X i X j ) ā E ( X i ) E ( X j ) = Ī£ i j \text{Cov}(X_i, X_j) = E(X_iX_j) - E(X_i)E(X_j) = \Sigma_{ij} Cov ( X i ā , X j ā ) = E ( X i ā X j ā ) ā E ( X i ā ) E ( X j ā ) = Ī£ ij ā
Thus, the covariance matrix of X \mathbf{X} X is Σ \Sigma Σ .
Given n = 2 n=2 n = 2 in multivariate normal distribution ,
we have the bivariate normal distribution .
The joint PDF of bivariate normal distribution is
f X 1 , X 2 ( x 1 , x 2 ) = ⣠det ā” A ā 1 ⣠2 Ļ e ā 1 2 ( x ā μ ) T Ī£ ā 1 ( x ā μ ) f_{X_1, X_2}(x_1, x_2) = \frac{|\det A^{-1}|}{2\pi} e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} f X 1 ā , X 2 ā ā ( x 1 ā , x 2 ā ) = 2 Ļ ā£ det A ā 1 ⣠ā e ā 2 1 ā ( x ā μ ) T Ī£ ā 1 ( x ā μ )
where μ = ( μ 1 , μ 2 ) T \mu = (\mu_1, \mu_2)^T μ = ( μ 1 ā , μ 2 ā ) T and Ī£ = [ Ļ 11 Ļ 12 Ļ 21 Ļ 22 ] \Sigma = \begin{bmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix} Ī£ = [ Ļ 11 ā Ļ 21 ā ā Ļ 12 ā Ļ 22 ā ā ] .
By the moments of multivariate normal distribution ,
Σ \Sigma Σ can also be expressed as
Ī£ = [ Ļ 11 Ļ 12 Ļ 21 Ļ 22 ] = [ Var ( X 1 ) Cov ( X 1 , X 2 ) Cov ( X 1 , X 2 ) Var ( X 2 ) ] \Sigma = \begin{bmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) \\ \text{Cov}(X_1, X_2) & \text{Var}(X_2) \end{bmatrix} Ī£ = [ Ļ 11 ā Ļ 21 ā ā Ļ 12 ā Ļ 22 ā ā ] = [ Var ( X 1 ā ) Cov ( X 1 ā , X 2 ā ) ā Cov ( X 1 ā , X 2 ā ) Var ( X 2 ā ) ā ]
Given X X X a multivariate normal distribution,
and Y = A X + b Y = AX+b Y = A X + b , then Y Y Y is also a multivariate normal distribution.
Given X X X a multivariate normal distribution,
to get the marginal distribution of Y = ( X k 1 , X k 2 , ⦠, X k i ) Y = (X_{k_1},X_{k_2},\ldots,X_{k_i}) Y = ( X k 1 ā ā , X k 2 ā ā , ⦠, X k i ā ā ) ,
we can let A A A be a i Ć n i\times n i Ć n matrix with 1 1 1 at the k 1 , k 2 , ⦠, k i k_1, k_2, \ldots, k_i k 1 ā , k 2 ā , ⦠, k i ā -th row,
and b b b be a i Ć 1 i\times 1 i Ć 1 vector of zeros.
Then Y = A X + b Y= AX + b Y = A X + b .
Given X X X a multivariate normal distribution,
X i X_i X i ā and X j X_j X j ā are independent if and only if Cov ( X i , X j ) = 0 \text{Cov}(X_i, X_j) = 0 Cov ( X i ā , X j ā ) = 0 .
Given X X X a multivariate normal distribution,
if Σ \Sigma Σ is a singular matrix,
then X X X is a degenerate multivariate normal distribution.
Suppose x x x is a eigenvector of Σ \Sigma Σ with eigenvalue 0 0 0 ,
then let Y = x T X Y = x^T X Y = x T X ,
then the mean of Y Y Y is x T μ x^T \mu x T μ ,
and the variance of Y Y Y is x T Σ x = 0 x^T \Sigma x = 0 x T Σ x = 0 .
Then Y Y Y is the distribution of a constant.
In this section, we assume X X X to be IID random variables. And μ = E [ X ] \mu = E[X] μ = E [ X ] , Ļ 2 = Var ( X ) \sigma^2 = \text{Var}(X) Ļ 2 = Var ( X ) .
Let S n = ā i = 1 n X S_{n} = \sum_{i=1}^{n}X S n ā = ā i = 1 n ā X .
Then,
E [ S n ] = n μ E[S_{n}] = n\mu E [ S n ā ] = n μ
Var ( S n ) = n Ļ 2 \text{Var}(S_{n}) = n\sigma^2 Var ( S n ā ) = n Ļ 2
E [ S n n ] = μ E\left[\frac{S_{n}}{n}\right] = \mu E [ n S n ā ā ] = μ
Var ( S n n ) = Ļ 2 n \text{Var}\left(\frac{S_{n}}{n}\right) = \frac{\sigma^2}{n} Var ( n S n ā ā ) = n Ļ 2 ā
By intuition, as n n n increases, the sample mean S n n \frac{S_{n}}{n} n S n ā ā converges to the population mean μ \mu μ . In formal terms, we have the weak law of large numbers :
P ( lim ā” n ā ā S n n = μ ) = 1 P\left(\lim_{n\rightarrow\infty} \frac{S_{n}}{n} = \mu\right) = 1 P ( n ā ā lim ā n S n ā ā = μ ) = 1
Or equivalently,
for all ϵ > 0 \epsilon > 0 ϵ > 0 ,
lim ā” n ā ā P ( ⣠S n n ā μ ⣠℠ϵ ) = 0 \lim_{n\rightarrow\infty} P\left(|\frac{S_{n}}{n} - \mu| \geq \epsilon\right) = 0 n ā ā lim ā P ( ⣠n S n ā ā ā μ ⣠℠ϵ ) = 0
Given X X X a random variable with CDF F X ( x ) F_X(x) F X ā ( x ) ,
and sequence of random variables X 1 , X 2 , ⦠X_1, X_2, \ldots X 1 ā , X 2 ā , ⦠with CDF F X n ( x ) F_{X_n}(x) F X n ā ā ( x ) ,
we say that the sequence of random variables converges in distribution to X X X if
F X n ( x ) F_{X_n}(x) F X n ā ā ( x ) converge pointwise to F X ( x ) F_X(x) F X ā ( x ) .
As the MGF uniquely determines the distribution of a random variable,
we have the following theorem:
Given X X X a random variable with MGF M X ( t ) M_X(t) M X ā ( t ) ,
and sequence of random variables X 1 , X 2 , ⦠X_1, X_2, \ldots X 1 ā , X 2 ā , ⦠with MGF M X n ( t ) M_{X_n}(t) M X n ā ā ( t ) ,
and all of the MGFs exist and are finite on a same open neighbourhood of 0 0 0 ,
then X 1 , X 2 , ⦠X_1, X_2, \ldots X 1 ā , X 2 ā , ⦠converge in distribution to X X X if and only if
M X n ( t ) M_{X_n}(t) M X n ā ā ( t ) converge pointwise to M X ( t ) M_X(t) M X ā ( t ) on the open neighbourhood of 0 0 0 .
Given X X X a random variable with mean μ \mu μ and variance Ļ 2 \sigma^2 Ļ 2 ,
and S n = ā i = 1 n X S_{n} = \sum_{i=1}^{n}X S n ā = ā i = 1 n ā X ,
then the distribution of S n ā n μ n Ļ \frac{S_{n} - n\mu}{\sqrt{n}\sigma} n ā Ļ S n ā ā n μ ā converges in distribution to the standard normal distribution.