Deep Learning

53_Bernoulli, Binomial, and Multinomial Distribution

elif 2024. 1. 22. 20:41

In this post, we will discuss various distributions. The role of these distributions is to model the probability distribution $p(x)$ of a random variable $x$ when a set of observed data is given, which is known as density estimation.

Before addressing Gaussian distributions for continuous variables, we will first discuss distributions for discrete variables.

 

Bernoulli Distribution

To give an example with coin flipping, $x=1$ represents heads and $x=0$ represents tails. The probability of getting heads when flipping the coin, that is, the probability of $x=1$, can be denoted by the paramter $\mu$ and expressed as follows.

 

 

Since $0 \leqslant \mu  \leqslant 1$, this condition hold true. Therefore, the probability distribution for $x$ is as follows.

 

 

This is known as the Bernoulli distribution, and the normalized mean and variance are as follows.

 

 

Given a dataset $D = \left\{ {{x_1}, \cdots ,{x_n}} \right\}$ consisting of observed values of $x$, assuming that the observations are independently drawn from $p(x|\mu )$, the likelihood function can be constructed as follows.

 

 

By maximizing the log of the likelihood function, the value of $\mu$ can be estimated. Since the log is a monotonic function, it can be represented as follows.

 

 

Binomial Distribution

When the size of the dataset is $N$, the binomial distribution for the number of observations of $x=1$, denoted as $m$, can be calculated. To normalize, the binomial distribution where geads, i.e., $x=1$, occures $m$ times in $N$ coin clips can be expressed as follows.

 

 

The mean and variance of the binomial distribution are calculated through independent events and can be represented as follows.

 

 

Multinomial Distribution

While binary variables explain the probability of one of two possible values occurring, in cases where more than two values are chosen, a multinomial distribution is used, which can be represented as follows.

 

 

Here, $\mu  = {\left( {{\mu _1}, \cdots ,{\mu _k}} \right)^T}$, and since it represents probabilies, it must satisfy conditions ${\mu _k} \geqslant 0$ and $\sum\nolimits_k {{\mu _k} = 1} $. The above formula can represent a generalization of the Bernoulli distribution for two or more outcomes.

 

 

Considering a dataset $D$ with $N$ independent observations, the likelihood function can be represented as follows.

 

 

To fine the maximum likelihood solution for $\mu$, we need to maximize $\ln p(D|\mu )$ with respect to ${\mu _k}$, considering that the sum of all probabilities equals 1. For this, we use the Lagrange multiplier $\lambda $ to maximize as follows.

 

 

Differentiating the above equation with respect to ${m_k}$ yields the following.

 

 

Substituting the constraint into the above equation allows us to find the Lagrange multiplier, and the maximum likelihood solution is as follows.

 

 

Also, given the parameter vector $\mu$ and the total number of observations $N$, it can be expressed as follows.

 

 

This is known as the multinomial distribution, and the normalization constant is the number of ways to divide $N$ objects into $K$ groups, which can be represented as follows.

 

 

 

 

 

ref : Chris Bishop's "Deep Learning - Foundations and Concepts"

'Deep Learning' 카테고리의 다른 글

55_Likelihood Function  (0) 2024.01.24
54_Linear Regression  (0) 2024.01.23
52_Gaussian Distribution  (0) 2024.01.21
51_Distribution  (0) 2024.01.20
50_Probability  (0) 2024.01.19